Empty extraction variable results in an IndexError

nikolasthuesen commented 4 years ago

https://github.com/DaehwanKimLab/hisat-genotype/blob/cf91052c5a6c96aa4804e57e946cca1856935822/hisatgenotype_modules/hisatgenotype_typing_process.py#L1648

When running hisatgenotype on the sample HG01341 I get the following error as seen below:

1: Extracting reads from HG01341.bam_reads1.fastq
                running hisat2 -x /home/projects/cu_10148/people/nikthu/hisatgenotype_test/indicies/genotype_genome --no-spliced-alignment -X 1000 -1 //home/projects/cu_10148/people/nikthu/data/1000G_ten_benchmark/HG01341.bam_reads1.fastq -2 //home/projects/cu_10148/people/nikthu/data/1000G_ten_benchmark/HG01341.bam_reads2.fastq

Traceback (most recent call last):
  File "/home/projects/cu_10148/people/nikthu/hisatgenotype_test/hisatgenotype", line 762, in <module>
    typing_process(args)
  File "/home/projects/cu_10148/people/nikthu/hisatgenotype_test/hisatgenotype", line 513, in typing_process
    args.verbose)
  File "/home/projects/cu_10148/people/nikthu/hisatgenotype_test/hisatgenotype_modules/hisatgenotype_typing_process.py", line 1780, in extract_reads
    verbose)
  File "/home/projects/cu_10148/people/nikthu/hisatgenotype_test/hisatgenotype_modules/hisatgenotype_typing_process.py", line 1295, in parallel_work
    verbose)
  File "/home/projects/cu_10148/people/nikthu/hisatgenotype_test/hisatgenotype_modules/hisatgenotype_typing_process.py", line 1649, in work
    read2[0],
IndexError: list index out of range

I tried printing read2 in each iteration and found, that in the iteration with the error, it was an empty list.

Successful iteration: read2 = 
['GCGTTCCAGGTCTCTCAGCCAATACAGCTGCCGACATTGTTCAAGGGCCTCTTTATAGAGACAGACTT', "<000777<77<<B<BB<B<<B<7<B<0<<<7B<<<'''''''''''''''''''''''''''''''''"]
Faulty iteration: read2 = [ ]

The same was by the way true for read1.

Hisatgenotype then continued with the reads, which were extracted before the error and returned the faulty result:


A B C DRB1 DQB1
"# VERSIONS:"
"# HISAT2 - 2.2.1"

"# HISAT-genotype - 1.3.2"

"# Database - Database /home/projects/cu_10148/people/nikthu/hisatgenotype_test/indicies/hla derived from HISATgenotype DB version: NONE"
"# COMMAND:"
/home/projects/cu_10148/people/nikthu/hisatgenotype_test/hisatgenotype --base hla --locus-list A,B,C,DRB1,DQB1 -p 10 -1 /home/projects/cu_10148/people/nikthu/data/1000G_ten_benchmark/HG01341.bam_reads1.fastq -2 /home/projects/cu_10148/people/nikthu/data/1000G_ten_benchmark/HG01341.bam_reads2.fastq --in-dir / --out-dir /home/projects/cu_10148/people/nikthu/testdir -v

hisat2 graph
hisat2 --mm --no-unal --no-spliced-alignment -X 1000 --max-altstried 64 --haplotype -x /home/projects/cu_10148/people/nikthu/hisatgenotype_test/indicies/hla.graph -p 1 -1 /home/projects/cu_10148/people/nikthu/testdir/HG01341.bam_reads1.fastq-hla-extracted-1.fq.gz -2 /home/projects/cu_10148/people/nikthu/testdir/HG01341.bam_reads1.fastq-hla-extracted-2.fq.gz
                        4 reads and 2 pairs are aligned
                                1 A*68:02:02 (count: 1)
                                2 A*68:01:24 (count: 1)
                                3 A*29:01:01:02N (count: 1)
                                4 A*66:01:02 (count: 1)
                                5 A*29:46 (count: 1)
                                6 A*26:03:01 (count: 1)
                                7 A*26:04 (count: 1)
                                8 A*29:40 (count: 1)
                                9 A*66:02 (count: 1)
                                10 A*34:02:01 (count: 1)

                                1 ranked A*26:01:03 (abundance: 100.00%)

                        2 reads and 1 pairs are aligned
                                1 C*15:09 (count: 1)
                                2 C*04:01:62 (count: 1)
                                3 C*08:127N (count: 1)
                                4 C*01:02:30 (count: 1)
                                5 C*01:04 (count: 1)
                                6 C*06:146 (count: 1)
                                7 C*04:239 (count: 1)
                                8 C*04:09N (count: 1)
                                9 C*01:02:37 (count: 1)
                                10 C*04:70 (count: 1)

                        6 reads and 3 pairs are aligned
                                1 DRB1*04:01:01:01 (count: 2)
                                2 DRB1*04:01:01:02 (count: 2)
                                3 DRB1*04:07:01 (count: 2)
                                4 DRB1*04:03:01 (count: 2)
                                5 DRB1*09:21 (count: 1)
                                6 DRB1*07:01:01:03 (count: 1)
                                7 DRB1*07:01:01:01 (count: 1)
                                8 DRB1*07:01:01:02 (count: 1)
                                9 DRB1*12:06 (count: 1)
                                10 DRB1*07:01:20 (count: 1)

                                1 ranked DRB1*03:124 (abundance: 50.00%)
                                2 ranked DRB1*12:06 (abundance: 50.00%)

                        6 reads and 3 pairs are aligned
                                1 DQB1*04:02:01 (count: 3)
                                2 DQB1*04:11 (count: 3)
                                3 DQB1*03:03:02:04 (count: 2)
                                4 DQB1*03:02:01:01 (count: 2)
                                5 DQB1*03:02:01:02 (count: 2)
                                6 DQB1*03:02:12 (count: 2)
                                7 DQB1*03:05:01 (count: 2)
                                8 DQB1*03:211 (count: 2)
                                9 DQB1*04:01:01 (count: 2)
                                10 DQB1*04:32 (count: 2)

                                1 ranked DQB1*04:02:01 (abundance: 33.33%)
                                2 ranked DQB1*04:02:05 (abundance: 33.33%)
                                3 ranked DQB1*04:11 (abundance: 33.33%)

chbe-helix commented 4 years ago

Hi Nikolas,

Try running the script without the --in-dir option. There may be something in the interplay with your -1, -2 options and that --in-dir option.

Thanks, Chris

nikolasthuesen commented 4 years ago

Hi Chris Thanks for the reply. I've run the script in the same fashion for 20 samples, and it works for the other 19. I think this is triggered by an oddity of the sample, but I can try running it without the --in-dir tomorrow. Nikolas

chbe-helix commented 4 years ago

Hi Nikolas,

I see. In that case no need to run it without the --in-dir option. I agree with you that it may be something odd in the sample. It may be a read naming issue, sequence issue, or reporting issue. The following command is what HISAT-genotype uses to align the reads for extraction. The resulting SAM file should be useful in helping determining the root of your error.

hisat2 -p 10 --no-spliced-alignment -X 1000 -x /home/projects/cu_10148/people/nikthu/hisatgenotype_test/indicies/genotype_genome -1 /home/projects/cu_10148/people/nikthu/data/1000G_ten_benchmark/HG01341.bam_reads1.fastq -2 /home/projects/cu_10148/people/nikthu/data/1000G_ten_benchmark/HG01341.bam_reads2.fastq

I think I have that formatted to your specific use case so you should, hopefully, be able to run it directly by copy-and-paste (note it will print to your console without a pipe '>' out). Let me know what you find. Hope this helps!

Thanks, Chris

nikolasthuesen commented 4 years ago

Hi Chris

Sorry for the late reply. I have now investigated the case further, and found that the caue of the error was a couple of unpaired reads in my fastq files. The error therefore was not in hisat2 or HISAT-genotype, but in my data conversion. Sorry for the inconvenience. The only thing, which then could be improved on your tools in relation to this case is maybe a warning to the user, when the input fastq files have faults, instead of the current behaviour, where the tool gives a prediction on a very limited number of reads.

Thanks Nikolas

chbe-helix commented 4 years ago

Hi Nikolas,

Thanks for the update! That's exactly my thought. I'll make a note to add an error check in the next version. Thanks!

Thanks, Chris

DaehwanKimLab / hisat-genotype

Empty extraction variable results in an IndexError #29