HadrienG / InSilicoSeq

:rocket: A sequencing simulator
https://insilicoseq.readthedocs.io
MIT License
185 stars 32 forks source link

create error model (IndexError) #89

Closed antingH closed 4 years ago

antingH commented 5 years ago

Encountered errors when running iss model:

DEBUG:iss.app:iss version 1.3.3 DEBUG:iss.app:Using verbose logger INFO:iss.app:Starting iss model INFO:iss.app:Using KDE ErrorModel INFO:iss.bam:Reading bam file: sortedRef.bam INFO:iss.bam:Calculating insert size distribution /home/anting/.local/lib/python3.6/site-packages/numpy/core/_methods.py:140: RuntimeWarning: Degrees of freedom <= 0 for slice keepdims=keepdims) /home/anting/.local/lib/python3.6/site-packages/numpy/core/_methods.py:110: RuntimeWarning: invalid value encountered in true_divide arrmean, rcount, out=arrmean, casting='unsafe', subok=False) /home/anting/.local/lib/python3.6/site-packages/numpy/core/_methods.py:132: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) Traceback (most recent call last): File "/home/anting/.local/bin/iss", line 11, in sys.exit(main()) File "/home/anting/.local/lib/python3.6/site-packages/iss/app.py", line 489, in main args.func(args) File "/home/anting/.local/lib/python3.6/site-packages/iss/app.py", line 263, in model_from_bam bam.to_model(args.bam, args.output) File "/home/anting/.local/lib/python3.6/site-packages/iss/bam.py", line 189, in to_model hist_insert_size = modeller.insert_size(insert_size_dist) File "/home/anting/.local/lib/python3.6/site-packages/iss/modeller.py", line 28, in insert_size bw_method=0.2 / np.std(insert_size_distribution, ddof=1)) File "/home/anting/.local/lib/python3.6/site-packages/scipy/stats/kde.py", line 195, in init raise ValueError("dataset input should have multiple elements.") ValueError: dataset input should have multiple elements.

HadrienG commented 5 years ago

Hi!

Thank you for reporting this. Are you using single-end reads? Another possibility is that all your reads have the same insert size.

Is it possible for you to share your bam file, or your input reads + reference genome? That'd greatly help.

Best, Hadrien

antingH commented 5 years ago

Hi Hadrien, the bam files are at https://drive.google.com/file/d/1PhGeyhw6zo77pmy8SjoibfD8hyIUcRVL/view?usp=sharing Thanks for working on this!

HadrienG commented 5 years ago

Hi!

I had a look at you bam file, and it appears that the reads are unpaired:

(genomics) hadrien:~/Downloads/bam$ samtools flagstat sortedRef.bam
44204316 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
42463322 + 0 mapped (96.06% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
(genomics) hadrien:~/Downloads/bam$ samtools view sortedRef.bam | cut -f9 | sort | uniq 
0

as a result InSilicoSeq fails to calculate the insert size distribution. I had not anticipated single-end Illumina data since I though it was pretty uncommon.

Since currently it is only possible to generate paired-end dataset, I advise you to build your model using paired-end data. I will think about adding single-end support in the next release though!

Best, Hadrien

antingH commented 5 years ago

Hi Hadrien, I ran iss model on a paired-end dataset, this time getting a different error message:

DEBUG:iss.app:iss version 1.3.3 DEBUG:iss.app:Using verbose logger INFO:iss.app:Starting iss model INFO:iss.app:Using KDE ErrorModel INFO:iss.bam:Reading bam file: sortedRef.bam Traceback (most recent call last):0000 reads File "/home/anting/.local/bin/iss", line 11, in sys.exit(main()) File "/home/anting/.local/lib/python3.6/site-packages/iss/app.py", line 489, in main args.func(args) File "/home/anting/.local/lib/python3.6/site-packages/iss/app.py", line 263, in model_from_bam bam.to_model(args.bam, args.output) File "/home/anting/.local/lib/python3.6/site-packages/iss/bam.py", line 177, in to_model subst_matrix_f[pos, subst] += 1 IndexError: index 301 is out of bounds for axis 0 with size 301

The bam files can be found here: https://drive.google.com/file/d/1UmeHdW5tAoeeXkyqBQylJ3k0RFc_zCus/view?usp=sharing Thanks so much!

standage commented 5 years ago

It looks like the dimensions of the substitution matrix are hard-coded to 301x16 at the beginning of the to_model function. Presumably this limits the read lengths to 2x150? Or is it 2x300? In any case, the BAM file referenced above contains numerous read fragments longer than 300bp—maximum fragment length is 491bp.

HadrienG commented 5 years ago

Hi,

First of all sorry for the delay, I haven't been able to take time to look at this in detail.

@standage thanks for taking a look! You are right that the matrices are initialised with a length of 301. They are later resized to the longest reads present in the dataset (100, or 150 per example).

This indeed currently limits InSilicoSeq to 300 base pair fragments, which to my knowledge is the longest illumina reads on the market (MiSeq).

@antingH could you tell us more about your input bam file? From which sequencing instrument is it from?

standage commented 5 years ago

I presume that setting the default to something like 501 would have performance (runtime and memory) consequences. Would there be any other consequences (i.e. wonky results)?

In the referenced data set, for reads with no indels or indel errors (i.e., cigar =~ m/^\d+M$/), 2.5% of the read fragments are > 300bp in length. Fewer than 0.04% are longer than 350bp in length.

Some combination of relaxing this constraint vs. trimming/discarding long reads might allow this analysis to move forward.

HadrienG commented 5 years ago

Initialising the arrays at 501 would add roughly 100Kb of memory usage 😛

There would not be other consequences for "normal-sized" datasets as far as I can anticipate. My concern is that Illumina sequencing works in cycles, and reads should not be longer than 301 base pairs.

I thing discarding the longer reads would be the better fix here, and maybe add a warning saying that it's weird Illumina data.

standage commented 5 years ago

Does InSilicoSeq work only on Illumina data?

HadrienG commented 5 years ago

For the moment yes. We have plans to simulate long reads (mostly nanopore) but not in the short-term.

standage commented 5 years ago

I'm pretty sure these are Ion Torrent reads.

HadrienG commented 5 years ago

In that case I'd advise against using insilicoseq as it will not accurately model homopolymers errors