Open lucianhu opened 1 month ago
Hi Nhu, i also use megahit and am getting the same error "... line 1260, in make_unclassified_seq_fasta fasta_dict[seq])) ... KeyError: ...." Did you find a solution? Thanks, Shea
Hi Shea,
No, I haven't found a solution to the error yet.
Best,
Nhu
I'm trying the latest version 6.0.1 and will let you know if that works for me.
Thank you very much. I posted my question on Biostars, but no one replied. Have a good day, Nhu
v6.0.1 still has this error.
Thank you for your message. In my case, I used Kraken2 to classify reads, but it's actually not ideal because the classification percentage is very low.
Hi there, sorry about the wait. From what the error is, I don't think there is an issue with the contigs or the assembly. There seems to be something going wrong between the read names in the read files and the read names in the sorted bam file. Could you post the first 10 lines of one of the read files and the first 10 lines of the corresponding sorted bam file (output of samtools view)?
Hello Thauptfeld,
Thank you for your response.
Here is my BAM file,
FT100012261L1C012R00301420211 99 k127_43926 1 60 150M = 62 211 GCCGATCCGCTCGAGCCCGCGCCCAAAGGCGCCGCCGGGCAACTCGCCGAAGCGCACCAGCACGACCGAGCGTCCGTTGCGCCCGAGCCGCTCGCGCTTCTCGAACCACCGGTTGCCGGCGAACGCCATCAACAGCACGACGACGATCAA IIIIEDIIIIDIIEIIIIIIIIIIEEEIIIIIIIIIIIIIEEIDIIIIIEEIIIIEIIEIIEIIEIIIEIIIDIIICDIIIIIIIEIIIIICIIIIIDDIDIIEEIIEIIIICDIIIIIIIEEIIIIEDIEEIEIIEIIEIIEGIECIEE NM:i:0 MD:Z:150 MC:Z:150M MQ:i:60 AS:i:150 XS:i:0
FT100012261L1C012R00301420211 147 k127_43926 62 60 150M = 1 -211 ACGACCGAGCGTCCGTTGCGCCCGAGCCGCTCGCGCTTCTCGAACCACCGGTTGCCGGCGAACGCCATCAACAGCACGACGACGATCAACAGCGGGAGCAGCGCGACCAGCATCACGAGTGCGAGGAGGACGTCGCCCAGGCGCTTGAGG CIHCIIICIIICIIIDEEIIIGI?CIIIIIDIIIIIEEIDIICBIICIIIIEEGIIIIIIBCIIIICEIDCICIICIICIICIICDICCIDIIIIIDIICIIIIICIIDIICEIDIIDIEIIIDIIDIICIIEIIIIIDIIIIIEEIDII NM:i:0 MD:Z:150 MC:Z:150M MQ:i:60 AS:i:150 XS:i:0
FT100012261L1C006R00300498314 99 k127_43926 72 40 127S19M4S = 72 19 GCCCCCCCTGTCTTCGGCTTCCTCGGCGCCTTCCTTCTTGATGTGACCCTCGACGTCGAGCTCCTCGTTGTTCAGCAGGTTGCTCGATAAGTTGAGCAGCTCATCCGACATGTCTCCTCCTTCGGGAGTCCGTTGCGCCCGAGCCGTGCA IIIIIIIIDICIDDIIBICD@IDIIIIDIIDDIIDDIDD@DCICIBIIICIF7IFCIIEIICIICIHCCCCCICHIBIICDI=CI=6C8DGCCCEAIEIIDIDCII6DIDCICICIICIICCII?IDHAIIICC:IIIII)?@IIGAII? NM:i:0 MD:Z:19 MC:Z:15S19M116S MQ:i:40 AS:i:19 XS:i:0
FT100012261L1C006R00300498314 147 k127_43926 72 40 15S19M116S = 72 -19 TCTCCTCCTTCGGGAGTCCGTTGCGCCCGAGCCGTGCAGTTACTCCTCCAGACGTGGTGGAGGAGCGAAAGCTTCCCGGCATTGTCTCGCTCTCCGAGAAGCGATTCAATCCCTCGATGAGCCGGTTCCGTGTGCTGGCGTCGTCCCTGC DI?HIDGIBEIIIICIBIIIEEIIIIIIICIIHIEIIDIEECIDIIDIIDICIIEIIEIIDIIDIIIDDCIIDEIIIIIICEEIEIEIIIDIEIIIDIDCIIIDEEIDDEIIIEIIDDIDIIIIIEEIIIEIEIIEIIIIEIIEIIIDII NM:i:0 MD:Z:19 MC:Z:127S19M4S MQ:i:40 AS:i:19 XS:i:0
FT100012261L1C003R00101959697 163 k127_43926 111 60 150M = 165 204 GGTTGCCGGCGAACGCCATCAACAGCACGACGACGATCAACAGCGGGAGCAGCGCGACCAGCATCACGAGTGCGAGGAGGACGTCGCCCAGGCGCTTGAGGAGCAGCGAAACGCGGGCACGCGGGCGCCCGAGCCGTTGCGCGGCTCGTG IIDDIIIIIIIEEIIIIEDIEEIEIIEIIEIIEIIECIEEIEIIIIIEIIEIIIIIEIIEIIECIEIIEICIIIEIIDIIDIICIIIIIEIIIIICCIDIIEIIEIIIEEEIIIIIIIEIIIIIIIIIIIIDIIIICCIIIIIIICIICI NM:i:0 MD:Z:150 MC:Z:150M MQ:i:60 AS:i:150 XS:i:0
and this is from the read files.
@FT100012261L1C001R00100000393/1
TTTGCCGGACAACCAGCCGCCGCCCAAGGGGGACCAAGGCAGCAGCCCGATCCCCGCATCGAGCGCCGCCGGGACGATCTCGAATTCGATGTCCCGGACAAGAAGGCTGTACTGCGGCTGCAGCCTCACCGGCGGGTTCCAGCCTTGGCT
+
CCCIIIF<6I@EII;;II=IIIIII>>IB?II.IIE:I:I@DI><IIIIC=IIIIII:CII5=I6IIIIIIIIDIH,BICA>E0:?II*BICIII=.@IE-GD;AI05(=5ICII0IIC/I<I)IBI-I0@FI.FIB=II,-II*C6II;
@FT100012261L1C001R00100000429/1
TCCAGCGGCACGCCTAGCTCCTGCGCGACCGTTTCCAACGACCGGGGCGTGCTCAGAATGTTTTTACGCTCGAAATTGCCCCGCTCGGTCACGTCGAAGTAGCGCAGCACCAGTGGTGCGTCGTCGGGTCCTACCACGGCAGTAATCTCG
+
CIIEDII<IDIIIICEIICIICIIIIIDIIICC<IIDEIFDIIIIIII<C;IBIEIE@CH5BCCCEIIICIHEDDCCIIIIIIHCI9IAIEIHBIHDBIC8IIHIDHIEIIEF<IACHII+IC7IA-IBIIC<IIEII/ID-CD@CICIA
@FT100012261L1C001R00100000695/1
CGAGGAAGAGGCCCGCGACCAGATCGAGACCAACCTGTTCGGCGCCCTGTGGATCACCCAGGCTGCGCTGCCCATCCTGCGTGAGCAGGGGAGCGGTCACATCATCCAGGTGTCGTCGATCGGGGGCATCTCCGCGTTCCCGCGCGTCGG
I look forward to hearing from you.
Best,
Nhu
Hi All, yesterday, I am pretty sure I was able to get RAT running with a similar bam file by changing "fasta_dict[seq]" to "fasta_dict[seq_id]" in line 1259 of reads.py: outf.write('>{0}{1}\n{2}\n'.format(seq_id, suffix, fasta_dict[seq]))
Cheers, Shea
Hello Thauptfeld,
Thank you for considering my feedback. Yesterday, I tried your updated code, but there are still some issues.
[M::mem_process_seqs] Processed 628586 reads in 129.108 CPU sec, 5.694 real sec
[main] Version: 0.7.18-r1243-dirty
[main] CMD: bwa mem -t 24 /path/to/Phase_1_1.contigs.fa ./Phase_1_1_1.fastq ./Phase_1_1_2.fastq
[main] Real time: 114.846 sec; CPU: 2686.158 sec
[2024-11-02 10:05:31] Sorting bam file...
[bam_sort_core] merging from 0 files and 24 in-memory blocks...
[2024-11-02 10:05:39] Read mapping done!
[2024-11-02 10:05:39] contig2classification file supplied. Processing contig classifications.
[2024-11-02 10:05:39] Loading file /path/to/20240422_CAT_nr/tax/nodes.dmp.
[2024-11-02 10:05:41] Processing mapping file(s).
[2024-11-02 10:06:10] Chosen mode: cr. Classifying unclassified contigs and unmapped reads with diamond if no classification file is supplied.
[2024-11-02 10:06:10] No unmapped2classification file supplied .Grabbing unmapped and unclassified sequences...
[2024-11-02 10:06:11] Contigs written! Appending forward reads...
Traceback (most recent call last):
File "/path/to/CAT_pack", line 101, in <module>
main()
~~~~^^
File "/path/to/CAT_pack", line 85, in main
reads.run()
~~~~~~~~~^^
File "/path/to/CAT_pack/reads.py", line 435, in run
make_unclassified_seq_fasta(reads_files[0], unmapped_reads['fw'],
~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
uncl_unm_fasta, 'fastq', 'a','_1')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/path/to/CAT_pack/reads.py", line 1260, in make_unclassified_seq_fasta
fasta_dict[seq]))
~~~~~~~~~~^^^^^
KeyError: 'FT100012261L1C009R00400078634'
Sincerely,
Nhu
Hi Nhu,
the issue with the run seems to be that in the read files, the read id's have a /1 at the end (or a /2 in the reverse file I assume), which is removed during the mapping. So, in the bam file they don't have the /1 at the end. Therefore, RAT is looking for a read id that doesn't exist in the read file. RAT is supposed to be able to handle this for '/1' and '_1', so I'll have to check what is going on.
Will let you know.
Hello Team,
Thank you for the wonderful tool. However, I have two questions:
When I run
RAT,
it gives an error related to the reads with the following message. From what I understand, myFASTQ
headers must be present in the contigs. My contigs were generated byMEGAHIT
, and they only have names like>k127_10000
, without anyFASTQ
names. What should I do in this case?When I run
RAT
, it requires me to performCAT.contig2classification.txt
first, but your tutorial does not mention that.Does this
RAT
method (mapping reads to contigs) require re-running the deduplication step to accurately calculate relative abundance?Thank you and please assist,
Nhu