bcgsc / mavis

Merging, Annotation, Validation, and Illustration of Structural variants
http://mavis.bcgsc.ca
GNU General Public License v3.0
72 stars 13 forks source link

Help Using Limited Dataset #190

Closed MDisyak closed 4 years ago

MDisyak commented 5 years ago

I am attempting to generate a config file so that I can run annotate and validate on the data I have generating through our new rapid fusion detection tool. When running the config command I have come across some errors that I can't figure out, all of which pertain to the calculation of the median insert size (I think).

TLDR

Production bam with limited annotations -> list index out of range error My bam with limited annotation -> No median for empty data My bam with production annotation -> No median for empty data Production bam with full annotations and my input file -> Success (I didn't put this test in detail below)

This suggests there's an issue with my bam and also with my annotation file, but not with my input file.

END TLDR

These are the attempts I have made:

COMMAND 1 For this run, I used the standard (full) annotation file given in the source file: source /gsc/pipelines/mavis/env/production

Followed by: mavis config --library test transcriptome diseased True /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/alignment_results.sorted.bam --assign test /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis_input.txt --log mavis_log.log --log_level DEBUG -w /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis.cfg --skip_stage cluster

I also attempted this command with the '--input' flag and gave it my mavis_input.txt file. Same error.

Note that the data is not empty but contains a small set of aligned reads.

ERROR 1

  File "/home/mdisyak/git/mavis/venv/bin/mavis", line 11, in <module>
    load_entry_point('mavis==2.2.1', 'console_scripts', 'mavis')()
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/main.py", line 414, in main
    raise err
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/main.py", line 383, in main
    _config.generate_config(args, parser, log=_util.LOG)
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/config.py", line 616, in generate_config
    distribution_fraction=args.distribution_fraction
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/config.py", line 150, in build
    distribution_fraction=distribution_fraction
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/bam/stats.py", line 189, in compute_transcriptome_bam_stats
    read_length = stats.median(read_lengths)
  File "/projects/tumour_char/analysis_scripts/python/centos06/python-3.6.0/lib/python3.6/statistics.py", line 380, in median
    raise StatisticsError("no median for empty data")
statistics.StatisticsError: no median for empty data

Next attempt I generated a smaller annotation file using only annotations for EWSR1 (the gene that most of the reads in the bam file align to):

COMMAND 2 Using my custom source file (I just copied the production mavis one and changed the annotation to my EWSR1 annotation file):

source /home/mdisyak/ewsr1_annot mavis config --library test transcriptome diseased True /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/alignment_results.sorted.bam --assign test /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis_input.txt --log mavis_log.log --log_level DEBUG -w /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis.cfg --skip_stage cluster --input /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis_input.txt test

ERROR 2

/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/bam/stats.py:134: UserWarning: insufficient annotations to match requested sample size. requested 500, but only 1 annotations
  sample_size, len(total_annotations)))
Traceback (most recent call last):
  File "/home/mdisyak/git/mavis/venv/bin/mavis", line 11, in <module>
    load_entry_point('mavis==2.2.1', 'console_scripts', 'mavis')()
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/main.py", line 414, in main
    raise err
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/main.py", line 383, in main
    _config.generate_config(args, parser, log=_util.LOG)
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/config.py", line 616, in generate_config
    distribution_fraction=args.distribution_fraction
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/config.py", line 150, in build
    distribution_fraction=distribution_fraction
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/bam/stats.py", line 189, in compute_transcriptome_bam_stats
    read_length = stats.median(read_lengths)
  File "/projects/tumour_char/analysis_scripts/python/centos06/python-3.6.0/lib/python3.6/statistics.py", line 380, in median
    raise StatisticsError("no median for empty data")
statistics.StatisticsError: no median for empty data
a

COMMAND 3 Same as above but now specifying sample size:

mavis config --library test transcriptome diseased True /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/alignment_results.sorted.bam --assign test /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis_input.txt --log mavis_log.log --log_level DEBUG -w /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis.cfg --skip_stage cluster --transcriptome_bins 1

ERROR 3

/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/bam/stats.py:134: UserWarning: insufficient annotations to match requested sample size. requested 500, but only 2 annotations
  sample_size, len(total_annotations)))
Traceback (most recent call last):
  File "/home/mdisyak/git/mavis/venv/bin/mavis", line 11, in <module>
    load_entry_point('mavis==2.2.1', 'console_scripts', 'mavis')()
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/main.py", line 414, in main
    raise err
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/main.py", line 383, in main
    _config.generate_config(args, parser, log=_util.LOG)
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/config.py", line 616, in generate_config
    distribution_fraction=args.distribution_fraction
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/config.py", line 150, in build
    distribution_fraction=distribution_fraction
  File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/bam/stats.py", line 189, in compute_transcriptome_bam_stats
    read_length = stats.median(read_lengths)
  File "/projects/tumour_char/analysis_scripts/python/centos06/python-3.6.0/lib/python3.6/statistics.py", line 380, in median
    raise StatisticsError("no median for empty data")
statistics.StatisticsError: no median for empty data

I tried the above command with 2, 3, and 4 annotations in the annotation file. Each time I received the "no median for empty data" error.

To see if this was just an issue with my data I used a production bam:

COMMAND 4 This annotation file I used in this case has 2 annotations (EWSR1 and FLI1):

mavis config --library test transcriptome diseased True /projects/analysis/analysis32/P02965/merge37799_bwa-mem-0.7.6a-sb/75bp/hg19a_jg-e69/P02965_12_lanes_dupsFlagged.bam --assign test /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis_input.txt --log mavis_log.log --log_level DEBUG -w /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis.cfg --skip_stage cluster --transcriptome_bins 2

ERROR 4

Traceback (most recent call last): File "/home/mdisyak/git/mavis/venv/bin/mavis", line 11, in <module> load_entry_point('mavis==2.2.1', 'console_scripts', 'mavis')() File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/main.py", line 414, in main raise err File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/main.py", line 383, in main _config.generate_config(args, parser, log=_util.LOG) File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/config.py", line 616, in generate_config distribution_fraction=args.distribution_fraction File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/config.py", line 150, in build distribution_fraction=distribution_fraction File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/bam/stats.py", line 193, in compute_transcriptome_bam_stats median = result.median() File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/bam/stats.py", line 64, in median return (values[low_center - 1] + values[high_center - 1]) / 2 IndexError: list index out of range

So clearly, there's an issue with my data and my annotation file. I noticed there was some data in my bam file that was missing any positional data and a cigarstring, so I removed that data. Then I reran COMMAND 4

COMMAND 5 Run with 2 annotations in the annotation file.

mavis config --library test transcriptome diseased True ~/trimmed_pos_sorted.bam --assign test /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis_input.txt --log mavis_log.log --log_level DEBUG -w /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis.cfg --skip_stage cluster --transcriptome_bins 1

ERROR 5

File "/home/mdisyak/git/mavis/venv/bin/mavis", line 11, in <module> load_entry_point('mavis==2.2.1', 'console_scripts', 'mavis')() File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/main.py", line 414, in main raise err File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/main.py", line 383, in main _config.generate_config(args, parser, log=_util.LOG) File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/config.py", line 616, in generate_config distribution_fraction=args.distribution_fraction File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/config.py", line 150, in build distribution_fraction=distribution_fraction File "/home/mdisyak/git/mavis/venv/lib/python3.6/site-packages/mavis/bam/stats.py", line 189, in compute_transcriptome_bam_stats read_length = stats.median(read_lengths) File "/projects/tumour_char/analysis_scripts/python/centos06/python-3.6.0/lib/python3.6/statistics.py", line 380, in median raise StatisticsError("no median for empty data") statistics.StatisticsError: no median for empty data

Any insight you can give me would be great! I think it's probably something simple I'm just not understanding about how Mavis works or perhaps the way I generated my bam file.

All relevant files are listed below:

Input file: /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis_input.txt

Bam file: /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/alignment_results.sorted.bam

Production Bam /projects/analysis/analysis32/P02965/merge37799_bwa-mem-0.7.6a-sb/75bp/hg19a_jg-e69/P02965_12_lanes_dupsFlagged.bam

Custom annotation file: /home/mdisyak/ewsr1.json

Source file referencing the above annotation file: /home/mdisyak/ewsr1_annot

MDisyak commented 5 years ago

The original issue has been resolved. Mavis can't handle only weird reads as it can't calculate the fragment size and std dev properly. As discussed, there is an issue with selecting transcriptome_bins of a size equal to the number of annotations. For example, if there are only 2 annotations, Mavis will only accept a transcriptome_bin size of 1.

Example:

mavis config --library test transcriptome diseased True /projects/analysis/analysis29/P02414/merge21428_bwa-mem-0.7.6a-sb/75nt/hg19a_jg-e69/P02414_8_lanes_dupsFlagged.bam --assign test /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis_input.txt -w /projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis.cfg --skip_stage cluster --annotations ~/ewsr1.json

Error Message MAVIS: 2.2.4 hostname: gphost06.bcgsc.ca [2019-05-08 11:28:32] arguments add_defaults = False annotations = ['/home/mdisyak/ewsr1.json'] assign = [['test', '/projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis_input.txt']] command = 'config' convert = [] distribution_fraction = 0.97 external_conversion = [] genome_bins = 100 input = [] library = [['test', 'transcriptome', 'diseased', 'True', '/projects/analysis/analysis29/P02414/merge21428_bwa-mem-0.7.6a-sb/75nt/hg19a_jg-e69/P02414_8_lanes_dupsFlagged.bam']] log = None log_level = 'INFO' skip_stage = ['cluster'] transcriptome_bins = 500 write = '/projects/trans_scratch/validations/workspace/mdisyak/apr23_rapid/alignment_dir/mavis.cfg' generating the config section for: test [2019-05-08 11:28:32] loading: ['/home/mdisyak/ewsr1.json'] /home/mdisyak/git/mavis/mavis/mavis/bam/stats.py:150: UserWarning: insufficient annotations to match requested sample size. requested 500, but only 2 annotations sample_size, len(total_annotations) Traceback (most recent call last): File "/home/mdisyak/git/venv_mavis_dev/bin/mavis", line 11, in <module> load_entry_point('mavis', 'console_scripts', 'mavis')() File "/home/mdisyak/git/mavis/mavis/mavis/main.py", line 414, in main raise err File "/home/mdisyak/git/mavis/mavis/mavis/main.py", line 383, in main _config.generate_config(args, parser, log=_util.LOG) File "/home/mdisyak/git/mavis/mavis/mavis/config.py", line 616, in generate_config distribution_fraction=args.distribution_fraction File "/home/mdisyak/git/mavis/mavis/mavis/config.py", line 150, in build distribution_fraction=distribution_fraction File "/home/mdisyak/git/mavis/mavis/mavis/bam/stats.py", line 219, in compute_transcriptome_bam_stats err = result.distribution_stderr(median, distribution_fraction) File "/home/mdisyak/git/mavis/mavis/mavis/bam/stats.py", line 91, in distribution_stderr return sum(values[0:end]) / end ZeroDivisionError: division by zero