Closed antingH closed 4 years ago
Hi!
Thank you for reporting this. Are you using single-end reads? Another possibility is that all your reads have the same insert size.
Is it possible for you to share your bam file, or your input reads + reference genome? That'd greatly help.
Best, Hadrien
Hi Hadrien, the bam files are at https://drive.google.com/file/d/1PhGeyhw6zo77pmy8SjoibfD8hyIUcRVL/view?usp=sharing Thanks for working on this!
Hi!
I had a look at you bam file, and it appears that the reads are unpaired:
(genomics) hadrien:~/Downloads/bam$ samtools flagstat sortedRef.bam
44204316 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
42463322 + 0 mapped (96.06% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
(genomics) hadrien:~/Downloads/bam$ samtools view sortedRef.bam | cut -f9 | sort | uniq
0
as a result InSilicoSeq fails to calculate the insert size distribution. I had not anticipated single-end Illumina data since I though it was pretty uncommon.
Since currently it is only possible to generate paired-end dataset, I advise you to build your model using paired-end data. I will think about adding single-end support in the next release though!
Best, Hadrien
Hi Hadrien, I ran iss model on a paired-end dataset, this time getting a different error message:
DEBUG:iss.app:iss version 1.3.3
DEBUG:iss.app:Using verbose logger
INFO:iss.app:Starting iss model
INFO:iss.app:Using KDE ErrorModel
INFO:iss.bam:Reading bam file: sortedRef.bam
Traceback (most recent call last):0000 reads
File "/home/anting/.local/bin/iss", line 11, in
The bam files can be found here: https://drive.google.com/file/d/1UmeHdW5tAoeeXkyqBQylJ3k0RFc_zCus/view?usp=sharing Thanks so much!
It looks like the dimensions of the substitution matrix are hard-coded to 301x16 at the beginning of the to_model
function. Presumably this limits the read lengths to 2x150? Or is it 2x300? In any case, the BAM file referenced above contains numerous read fragments longer than 300bp—maximum fragment length is 491bp.
Hi,
First of all sorry for the delay, I haven't been able to take time to look at this in detail.
@standage thanks for taking a look! You are right that the matrices are initialised with a length of 301. They are later resized to the longest reads present in the dataset (100, or 150 per example).
This indeed currently limits InSilicoSeq to 300 base pair fragments, which to my knowledge is the longest illumina reads on the market (MiSeq).
@antingH could you tell us more about your input bam file? From which sequencing instrument is it from?
I presume that setting the default to something like 501 would have performance (runtime and memory) consequences. Would there be any other consequences (i.e. wonky results)?
In the referenced data set, for reads with no indels or indel errors (i.e., cigar =~ m/^\d+M$/
), 2.5% of the read fragments are > 300bp in length. Fewer than 0.04% are longer than 350bp in length.
Some combination of relaxing this constraint vs. trimming/discarding long reads might allow this analysis to move forward.
Initialising the arrays at 501 would add roughly 100Kb of memory usage 😛
There would not be other consequences for "normal-sized" datasets as far as I can anticipate. My concern is that Illumina sequencing works in cycles, and reads should not be longer than 301 base pairs.
I thing discarding the longer reads would be the better fix here, and maybe add a warning saying that it's weird Illumina data.
Does InSilicoSeq work only on Illumina data?
For the moment yes. We have plans to simulate long reads (mostly nanopore) but not in the short-term.
I'm pretty sure these are Ion Torrent reads.
In that case I'd advise against using insilicoseq
as it will not accurately model homopolymers errors
Encountered errors when running iss model:
DEBUG:iss.app:iss version 1.3.3 DEBUG:iss.app:Using verbose logger INFO:iss.app:Starting iss model INFO:iss.app:Using KDE ErrorModel INFO:iss.bam:Reading bam file: sortedRef.bam INFO:iss.bam:Calculating insert size distribution /home/anting/.local/lib/python3.6/site-packages/numpy/core/_methods.py:140: RuntimeWarning: Degrees of freedom <= 0 for slice keepdims=keepdims) /home/anting/.local/lib/python3.6/site-packages/numpy/core/_methods.py:110: RuntimeWarning: invalid value encountered in true_divide arrmean, rcount, out=arrmean, casting='unsafe', subok=False) /home/anting/.local/lib/python3.6/site-packages/numpy/core/_methods.py:132: RuntimeWarning: invalid value encountered in double_scalars ret = ret.dtype.type(ret / rcount) Traceback (most recent call last): File "/home/anting/.local/bin/iss", line 11, in
sys.exit(main())
File "/home/anting/.local/lib/python3.6/site-packages/iss/app.py", line 489, in main
args.func(args)
File "/home/anting/.local/lib/python3.6/site-packages/iss/app.py", line 263, in model_from_bam
bam.to_model(args.bam, args.output)
File "/home/anting/.local/lib/python3.6/site-packages/iss/bam.py", line 189, in to_model
hist_insert_size = modeller.insert_size(insert_size_dist)
File "/home/anting/.local/lib/python3.6/site-packages/iss/modeller.py", line 28, in insert_size
bw_method=0.2 / np.std(insert_size_distribution, ddof=1))
File "/home/anting/.local/lib/python3.6/site-packages/scipy/stats/kde.py", line 195, in init
raise ValueError("
dataset
input should have multiple elements.") ValueError:dataset
input should have multiple elements.