SethCommichaux / taxaTarget

4 stars 0 forks source link

Error detecting input file format #1

Closed rishibhandari63 closed 2 years ago

rishibhandari63 commented 2 years ago

I have an error while running it for my shotgun metagenome reads. I have all my input reads and database in home folder and i have set a output folder in my scratch.

CPU threads: 8

Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1) Temporary directory: /scratch/class01/taxaTarget/10EE_S60.results

Target sequences to report alignments for: 25

Opening the database... [5.366s] Database: /home/class01/taxaTarget/data//marker_geneDB.fasta.dmnd (type: Diamond database, sequences: 877724, letters: 500196405) Block size = 2000000000 Opening the input file... [0.297s] Error: Error detecting input file format. First line seems to be blank. Beginning analysis. Mapping reads with Kaiju. Extracting reads mapped by Kaiju. Aligning reads with Diamond. Traceback (most recent call last): File "run_protist_pipeline_fda.py", line 101, in if os.path.getsize(out+'/kaiju.fasta.diamond') == 0: sys.exit("No reads mapped to the marker genes with Diamond. Analysis ended!") File "/opt/asn/apps/anaconda_3-2020.11/lib/python3.8/genericpath.py", line 50, in getsize return os.stat(filename).st_size FileNotFoundError: [Errno 2] No such file or directory: '/scratch/class01/taxaTarget/10EE_S60.results/kaiju.fasta.diamond'

Here is the header of my input file

@A00201R:540:HTWLMDSX2:4:1101:8540:1501:N:0:CTAGATTGCG+CGCCATATCT#0/1 GAGTTCTGCCATGACCACCGCCAGCGGCCTGCCGCCGTCGATCACCGTTTCCTCCCTCGACCTCGCGCGCCTGGAGGCGTTGCTGGATACCCCCGCC + FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF @A00201R:540:HTWLMDSX2:4:1101:9986:1501:N:0:CTAGATTGCG+CGCCATATCT#0/1 CGGGCGCCACTTCGCGCGCGTCCAGCATCACGTCCTTGCCGGTCCTGGCGTCGTGCAGCAGGAAGAAGCCGCCACCGCCCAGG + FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FF @A00201R:540:HTWLMDSX2:4:1101:10095:1501:N:0:CTAGATTGCG+CGCCATATCT#0/1 GCATCATGCTGCCCAACCATTCGCCGCTCGTCATTGCCGAACAGTTCGGCACGCTGGCAGCCCTTTTGCCAGGCCGTGTTGACCTTGGCCTTGG

In my result folder i have three output files, Kaiju, empty kaiju.fasta and read_file_info.txt.

Thank you.

SethCommichaux commented 2 years ago

Hello,

Could you share the first few lines of the kaiju file? I'm thinking there might be issues with how the fasta headers in that file are represented.

Thanks, Seth

rishibhandari63 commented 2 years ago

My kaiju file looks like

C A00201R:540:HTWLMDSX2:4:1101:25391:1501:N:0:CTAGATTGCG+CGCCATATCT#0 98 UniRef100_V5FSW9, C A00201R:540:HTWLMDSX2:4:1101:5954:1595:N:0:CTAGATTGCG+CGCCATATCT#0 155 UniRef100_T1MW98, C A00201R:540:HTWLMDSX2:4:1101:8088:1376:N:0:CTAGATTGCG+CGCCATATCT#0 66 UniRef100_L8H989, C A00201R:540:HTWLMDSX2:4:1101:23086:1297:N:0:CTAGATTGCG+CGCCATATCT#0 130 UniRef100_A0A444BV45, C A00201R:540:HTWLMDSX2:4:1101:11315:1329:N:0:CTAGATTGCG+CGCCATATCT#0 66 UniRef100_A0A2N1L258,UniRef100_A0A6V8QKA9,UniRef100_A0A395NZ34, C A00201R:540:HTWLMDSX2:4:1101:13711:2440:N:0:CTAGATTGCG+CGCCATATCT#0 95 UniRef100_A0A182WSH2,UniRef100_A0A182QM46,UniRef100_A0A182VZB5,UniRef 100_A0A182NS49,UniRef100_A0A182FBP3, C A00201R:540:HTWLMDSX2:4:1101:29116:3443:N:0:CTAGATTGCG+CGCCATATCT#0 72 UniRef100_A0A409XNK3,UniRef100_A0A409VTK7, C A00201R:540:HTWLMDSX2:4:1101:8006:3458:N:0:CTAGATTGCG+CGCCATATCT#0 107 UniRef100_F0XQ24, C A00201R:540:HTWLMDSX2:4:1101:4562:3724:N:0:CTAGATTGCG+CGCCATATCT#0 91 UniRef100_A0A6U4CSK6,UniRef100_A0A6U4PJC4, C A00201R:540:HTWLMDSX2:4:1101:10818:3881:N:0:CTAGATTGCG+CGCCATATCT#0 92 UniRef100_A0A6D2I8D3, C A00201R:540:HTWLMDSX2:4:1101:19397:3865:N:0:CTAGATTGCG+CGCCATATCT#0 119 UniRef100_A0A1A9X6J1,UniRef100_A0A1B0C157, C A00201R:540:HTWLMDSX2:4:1101:11017:3787:N:0:CTAGATTGCG+CGCCATATCG#0 77 UniRef100_A0A7J6YRQ6, C A00201R:540:HTWLMDSX2:4:1101:7184:4351:N:0:CTAGATTGCG+CGCCATATCT#0 99 UniRef100_A0A1W0A4D9, C A00201R:540:HTWLMDSX2:4:1101:10954:4460:N:0:CTAGATTGCG+CGCCATATCT#0 75 UniRef100_A0A425CKJ8,UniRef100_A0A3M6VAV2, C A00201R:540:HTWLMDSX2:4:1101:17219:3035:N:0:CTAGATTGCG+CGCCATATCT#0 75 UniRef100_A0A0R3RTC3, C A00201R:540:HTWLMDSX2:4:1101:9706:6214:N:0:CTAGATTGCG+CGCCATATCT#0 67 UniRef100_A0A2N1JHI0, C A00201R:540:HTWLMDSX2:4:1101:28275:6245:N:0:CTAGATTGCG+CGCCATATCT#0 73 UniRef100_A0A420I9F8, C A00201R:540:HTWLMDSX2:4:1101:5674:6308:N:0:CTAGATTGCG+CGCCATATCT#0 80 UniRef100_A0A553NNA0, C A00201R:540:HTWLMDSX2:4:1101:16396:7153:N:0:CTAGATTGCG+CGCCATATCT#0 77 UniRef100_A0A7E4V1Z9, C A00201R:540:HTWLMDSX2:4:1101:15790:5822:N:0:CTAGATTGCG+CGCCATATCT#0 70 UniRef100_A0A4Y9XY08, C A00201R:540:HTWLMDSX2:4:1101:6379:4711:N:0:CTAGATTGCG+CGCCATATCT#0 67 UniRef100_A0A7H9B3Z8, C A00201R:540:HTWLMDSX2:4:1101:18421:7247:N:0:CTAGATTGCG+CGCCATATCT#0 70 UniRef100_D2UZR0, C A00201R:540:HTWLMDSX2:4:1101:25491:7185:N:0:CTAGATTGCG+CGCCATATCT#0 76 UniRef100_A0A7J7MRD4, C A00201R:540:HTWLMDSX2:4:1101:3766:7952:N:0:CTAGATTGCG+CGCCATATCT#0 130 UniRef100_A0A2P6MPD3, C A00201R:540:HTWLMDSX2:4:1101:6677:7513:N:0:CTAGATTGCG+CGCCATATCT#0 66 UniRef100_UPI00046B86F1,UniRef100_UPI00101A6817,UniRef100_A0A7J7RN07, UniRef100_UPI00174EE613,UniRef100_UPI001879286F,UniRef100_UPI00187BE54D,UniRef100_A0A6J2LGH6,UniRef100_A0A091DAR8,UniRef100_G3QQ06,UniRef100_H2QIP3,UniRef100 _A0A2R8ZMY9,UniRef100_A0A2I2Z2S3,UniRef100_A0A2I3TTX4,UniRef100_A0A2R8ZL60,UniRef100_I3L6Y7,UniRef100_UPI00174F6B19,UniRef100_A0A4X1VM36,UniRef100_A0A1S3G8E0 ,UniRef100_A0A250YM71,UniRef100_UPI00098197E9,UniRef100_A0A480ZGB8, C A00201R:540:HTWLMDSX2:4:1101:2799:7623:N:0:CTAGATTGCG+CGCCATATCT#0 75 UniRef100_A0A1L9S8C0,UniRef100_A0A5N6GE77,UniRef100_A0A5N6ILS2,UniRef 100_A0A2J5HY18,UniRef100_A0A2I2FFA2,UniRef100_A0A2I1DCZ6,UniRef100_A0A0L1J1B3,UniRef100_A0A5N7DCG3,UniRef100_A0A5N6W4C4,UniRef100_A0A5N6YEP6,UniRef100_A0A5N6 EHA5,UniRef100_A0A5N6WKJ1,UniRef100_A0A5N6DQK4,UniRef100_A0A5N6T1F6,UniRef100_A0A5N6ZMC6,UniRef100_A0A5N7E2C6,UniRef100_A0A5N7AT84,UniRef100_A0A5N6V689, C A00201R:540:HTWLMDSX2:4:1101:19397:7842:N:0:CTAGATTGCG+CGCCATATCT#0 73 UniRef100_A0A3M7L515,UniRef100_A0A087SHP5,UniRef100_A0A1D1ZN97, C A00201R:540:HTWLMDSX2:4:1101:24912:8656:N:0:CTAGATTGCG+CGCCATATCT#0 176 UniRef100_UPI0005CE24EE,

SethCommichaux commented 2 years ago

Thanks! I see that the fastq file headers end with #0/1, but in the kaiju file they end with #0. Kaiju must be performing some unwanted parsing behavior.

rishibhandari63 commented 2 years ago

Do you have any suggestions to solve this issue?

SethCommichaux commented 2 years ago

Yes, I've introduced a stopgap measure to deal with it. Clone the repository again and let me know if you run into any errors. In the meantime I've reached out to the Kaiju folks and am waiting for their reply. Because they are cutting off the tail end of the fastq header they are losing whether it is the forward or reverse read that mapped.

palomo11 commented 1 year ago

Hi, I have a similar issue (No reads mapped to the marker genes with Kaiju. Analysis ended!) and it seems the stopgap measure is not solving it. Here you can find my fastq file and my kaiju output:

head Sample1_1.fastq

@V350086151L2C001R0010000001/1
CACACACCAAAAGAGCTGTGAGTCCGTGAGTCCCCAATCGCGAAGCACAATCGTTTTGCCGTTCGCGACATCAACAATCGCCCTCGGTTGCTTCCGATACACGTACCTGATACTTAAAGGAGGTCATTGTAATAGTTAAGTGCGGTAAAA
+
eedeeebbfffedEcdTcPbF_]W^e`de_Ge[d^dddecScd[b`ec`SddcfLeJeeefddfFfeeffeeedce`eeXe_cee`aWf\ZRd[cYEbeeee^dMRefbYdD^LQfeeJbcWde\cedeebDe`UdF[eeceTd]deeec
@V350086151L2C001R0010000002/1
GGTCAGCTTGAGTTCGACCTTGCCGCCCAGCGCCTCGAGGTTCTTCCCCTGCTCCTCGCGCGCGGCGCGCAGCGCGGCCGCCTGCTCCTCGCGGGCGATGCGCAGCGCCTCATCAATCCGCTGGTGCAGCGAGGCCATCTGCTCGGCGAG
+
ecfdefefeeefefee[edeeeefeeeffffeedbeebfecfeedeefeefeecefdddfeeeeedefeeeeceffeeeedfeedebffe__aedefddfedfcefeeefeeeedfeefeeffedfeeeeeIffefdfeeefeff`feVe
@V350086151L2C001R0010000077/1
GATCGCCCACAACGTCGTCATTGGCGACCATTGCCTCGTCGTCGCCCAGGTGGGCATCGCAGGCAGCACCCGGCTGGGGAATTACGTGGCCCTGGGTGGACAGGTCGGCCTGGCCGGCCACCTGAAGATCGGCAACCAGGTCACCGTCGC

head Sample1_2.fastq

@V350086151L2C001R0010000001/2
TCCACCTCACTCCCACCGAAGCACCGAAGAACGTGAACCCGCCTGCATCCATCAAAACTTGTGTCGTTGCCGTCATTTTTTAAAGCTTGTTGCCCTTGCTTTTACAGCACTTAACTAATACAATGACCTCCTTTAAGGATCAGGTACCAG
+
eeeeVIeeeEEPZWHbee_eOce[LNKecWeGcZeCc\DeZVedZcdee_VededddLPfYdedde`edefdedeWeVd`dMeebeeeSdHYIecaee^eedeeaLGeccRbddeeebFede]decCefZe_edPcHFbdW\bVIeLFQd
@V350086151L2C001R0010000002/2
CCCGCGTCGAGGCCGGGATCGCCGAGGGGCGGGAGGCGACGCTGCGCAAGCAGAGCGAGGCGCTCGCCGAGCTGATGGCCTCGCTGTACCAGCGGTTTGATGAGGCGCTGCGCATCGCCCGCGTGGCGCAGGCGGACGCGCAGCGCGCCG
+
eedeec^dcMfZddXSee_efeecUebadYbdee`b^e_eeb_efe\Lbf[ScFdef_eZbVOe_\d[cRfdJeeef_]QeddEdeQeeeefOe\Lec[aed]ee[]R`cC`cS^LaedPSRcEcUHec\e`ec`TeV_e]KceeMEfdH
@V350086151L2C001R0010000077/2
CTGCTGGAGGGCCAGGATCTGACGTTTCATTTTCCGGTCGGGCCGCGCGCGTGAGCCCAGCCACTTTTCGCCATCGGGGATGTCACGCATGTCCCCCGTCTGGGCGGCGGCGGTGACCTGGATGCCGATCTTCAGGTGGCCGGCGAGGCC

head kaiju

C   V350086151L2C001R0010080933 87  UniRef100_A0A4S4EAY3,UniRef100_UPI000CE19BB6,   
C   V350086151L2C001R0010342833 118 UniRef100_A0A421FSM4,UniRef100_A0A3R7JQ58,UniRef100_A0A3F2RCQ2, 
C   V350086151L2C001R0010427682 144 UniRef100_A0A6T9N8P3,   
C   V350086151L2C001R0010483629 249 UniRef100_A0A6T7JN01,   
C   V350086151L2C001R0010521392 116 UniRef100_B7FSJ6,UniRef100_A0A1E7FB78,  
C   V350086151L2C001R0010686242 126 UniRef100_A0A1V2LT19,UniRef100_A0A099P2Z5,UniRef100_A0A507ELQ1, 
C   V350086151L2C001R0010929114 155 UniRef100_A0A075B065,