bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

Read analysis -- sam_read Parse error. #171

Closed kosmasgal closed 1 year ago

kosmasgal commented 1 year ago

Hello! I've been having this error for any run of 'read_analysis.py' I do with a specific set of FASTQ reads I have. From the error I thought it might be because of a bad reference transcriptome file but I managed to produce a model with a different set of FASTQ reads. Regardless of that I am curious if you could help me understand what was at fault in this case. My input is a set of human lrRNA ONT reads, that I subsampled using seqtk to 100,000 reads, in order to test the software. Here is the output I get when I encounter the issue:

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
ERROR conda.cli.main_run:execute(32): Subprocess for 'conda run ['python', '/home/src/read_analysis.py', 'transcriptome', '-i', 'subsampled.fastq', '-rg', 'Homo_sapiens.GRCh37.75.dna.all.fa', '-rt', 'Homo_sapiens.GRCh37.cdna.all.fa', '-t', '16', '--no_intron_retention', '-o', 'subsampled']' command failed.  (See above for error)
[M::mm_idx_gen::4.957*1.56] collected minimizers
[M::mm_idx_gen::6.645*1.80] sorted minimizers
[M::main::6.675*1.80] loaded/built the index for 180253 target sequence(s)
[M::mm_mapopt_update::7.069*1.75] mid_occ = 108
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 180253
[M::mm_idx_stat::7.297*1.73] distinct minimizers: 15117783 (40.17% are singletons); average occurrences: 3.514; average spacing: 5.406; total length: 287163541
[E::sam_parse1] query name too long
[W::sam_read1] Parse error at line 180255
[main_samview] truncated file.
[E::idx_find_and_load] Could not retrieve index file for 'subsampled_transcriptome_alnm.bam'
Traceback (most recent call last):
  File "/home/src/read_analysis.py", line 751, in <module>
    main()
  File "/home/src/read_analysis.py", line 698, in main
    quantification, normalize)
  File "/home/src/read_analysis.py", line 111, in align_transcriptome
    unaligned_length, strandness = get_primary_sam.primary_and_unaligned(t_alnm, prefix + "_transcriptome")
  File "/home/src/get_primary_sam.py", line 183, in primary_and_unaligned
    strandness = float(pos_strand) / num_aligned
ZeroDivisionError: float division by zero

running the code with following parameters:

infile subsampled.fastq
ref_g Homo_sapiens.GRCh37.75.dna.all.fa
ref_t Homo_sapiens.GRCh37.cdna.all.fa
annot 
aligner minimap2
g_alnm 
t_alnm 
prefix subsampled
num_threads 16
model_fit True
intron_retention False
chimeric False
quantification False
normalize by transcript length False
2022-10-20 09:08:13: Read pre-process and unaligned reads analysis
2022-10-20 09:08:23: Alignment with minimap2 to reference transcriptome
2022-10-20 09:08:59: Processing transcriptome alignment file: bam
kosmasgal commented 1 year ago

Alternative error I've had with this file:

[W::sam_read1_sam] Parse error at line 117064
samtools view: error reading file "-"
[M::mm_idx_gen::55.316*2.13] collected minimizers
[M::mm_idx_gen::79.227*2.30] sorted minimizers
[M::main::79.227*2.30] loaded/built the index for 61 target sequence(s)
[M::mm_mapopt_update::82.605*2.24] mid_occ = 650
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 61
[M::mm_idx_stat::84.320*2.22] distinct minimizers: 163936851 (36.46% are singletons); average occurrences: 5.538; average spacing: 3.005; total length: 2728222451
[E::sam_parse1] query name too long
[W::sam_read1_sam] Parse error at line 63
samtools view: error reading file "-"
kosmasgal commented 1 year ago

The FASTQ reads I was using had QNAMEs that were too long, resulting in samtools having an issue converting the sam alignment of the reads to the reference transcriptome into a bam. I truncated all the query names and now the code runs fine.