MGXlab / CAT_pack

CAT/BAT/RAT: tools for taxonomic classification of contigs and metagenome-assembled genomes (MAGs) and for taxonomic profiling of metagenomes
MIT License
194 stars 31 forks source link

Questions Regarding RAT Tool and Contig Processing #133

Open lucianhu opened 1 month ago

lucianhu commented 1 month ago

Hello Team,

Thank you for the wonderful tool. However, I have two questions:

When I run RAT, it gives an error related to the reads with the following message. From what I understand, my FASTQ headers must be present in the contigs. My contigs were generated by MEGAHIT, and they only have names like >k127_10000, without any FASTQ names. What should I do in this case?

When I run RAT, it requires me to perform CAT.contig2classification.txt first, but your tutorial does not mention that.

Does this RAT method (mapping reads to contigs) require re-running the deduplication step to accurately calculate relative abundance?

Thank you and please assist,

Nhu

./CAT_pack reads --mode cr -c contigs.fa -1 R_1.fastq -2 R_2.fastq -d /path/to/20240422_CAT_nr/db/ -t /path/to/200422_CAT_nr/tax/ -o lala -p out.CAT.predicted_proteins.faa -a out.CAT.alignment.diamond --c2c out.CAT.contig2classification.txt

<frozen genericpath>:39: RuntimeWarning: bool is used as a file descriptor
# CAT_pack v6.0.

[2024-10-11 09:41:47] samtools found: samtools 1.21.
[2024-10-11 09:41:47] bwa found: Version: 0.7.18-r1243-dirty.

RAT is running. Mapping reads against assembly with bwa mem.

[2024-10-11 09:41:47] Running bwa mem for read mapping. File fastq.bwamem.sorted will be generated.Do not forget to cite bwa mem and samtools when using RAT in your publication!
[2024-10-11 09:41:47] Contigs fasta is already indexed.
[2024-10-11 09:41:47] Running bwa mem...
...
[main] Version: 0.7.18-r1243-dirty
[main] CMD: bwa mem -t 24 
[main] Real time: 118.429 sec; CPU: 2701.163 sec
[2024-10-11 09:43:52] Sorting bam file...
[bam_sort_core] merging from 0 files and 24 in-memory blocks...
[2024-10-11 09:44:00] Read mapping done!

[2024-10-11 09:44:00] contig2classification file supplied. Processing contig classifications.
[2024-10-11 09:44:00] Loading file 20240422_CAT_nr/tax/nodes.dmp.
[2024-10-11 09:44:03] Processing mapping file(s).

[2024-10-11 09:44:32] Chosen mode: cr. Classifying unclassified contigs and unmapped reads with diamond if no classification file is supplied.
[2024-10-11 09:44:32] No unmapped2classification file supplied .Grabbing unmapped and unclassified sequences...
[2024-10-11 09:44:33] Contigs written! Appending forward reads...
Traceback (most recent call last):
  File "CAT_pack/CAT_pack/./CAT_pack", line 101, in <module>
    main()
    ~~~~^^
  File "CAT_pack/CAT_pack/./CAT_pack", line 85, in main
    reads.run()
    ~~~~~~~~~^^
  File "CAT_pack/CAT_pack/reads.py", line 435, in run
    make_unclassified_seq_fasta(reads_files[0], unmapped_reads['fw'],
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                uncl_unm_fasta, 'fastq', 'a','_1')
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "CAT_pack/CAT_pack/reads.py", line 1260, in make_unclassified_seq_fasta
    fasta_dict[seq]))
    ~~~~~~~~~~^^^^^
KeyError: 'FT100012261L1C013R00202177739'
sheaster commented 1 month ago

Hi Nhu, i also use megahit and am getting the same error "... line 1260, in make_unclassified_seq_fasta fasta_dict[seq])) ... KeyError: ...." Did you find a solution? Thanks, Shea

lucianhu commented 1 month ago

Hi Shea,

No, I haven't found a solution to the error yet.

Best,

Nhu

sheaster commented 1 month ago

I'm trying the latest version 6.0.1 and will let you know if that works for me.

lucianhu commented 1 month ago

Thank you very much. I posted my question on Biostars, but no one replied. Have a good day, Nhu

sheaster commented 1 month ago

v6.0.1 still has this error.

lucianhu commented 1 month ago

Thank you for your message. In my case, I used Kraken2 to classify reads, but it's actually not ideal because the classification percentage is very low.

thauptfeld commented 4 weeks ago

Hi there, sorry about the wait. From what the error is, I don't think there is an issue with the contigs or the assembly. There seems to be something going wrong between the read names in the read files and the read names in the sorted bam file. Could you post the first 10 lines of one of the read files and the first 10 lines of the corresponding sorted bam file (output of samtools view)?

lucianhu commented 4 weeks ago

Hello Thauptfeld,

Thank you for your response.

Here is my BAM file,

FT100012261L1C012R00301420211   99  k127_43926  1   60  150M    =   62  211 GCCGATCCGCTCGAGCCCGCGCCCAAAGGCGCCGCCGGGCAACTCGCCGAAGCGCACCAGCACGACCGAGCGTCCGTTGCGCCCGAGCCGCTCGCGCTTCTCGAACCACCGGTTGCCGGCGAACGCCATCAACAGCACGACGACGATCAA  IIIIEDIIIIDIIEIIIIIIIIIIEEEIIIIIIIIIIIIIEEIDIIIIIEEIIIIEIIEIIEIIEIIIEIIIDIIICDIIIIIIIEIIIIICIIIIIDDIDIIEEIIEIIIICDIIIIIIIEEIIIIEDIEEIEIIEIIEIIEGIECIEE  NM:i:0  MD:Z:150    MC:Z:150M   MQ:i:60 AS:i:150    XS:i:0
FT100012261L1C012R00301420211   147 k127_43926  62  60  150M    =   1   -211    ACGACCGAGCGTCCGTTGCGCCCGAGCCGCTCGCGCTTCTCGAACCACCGGTTGCCGGCGAACGCCATCAACAGCACGACGACGATCAACAGCGGGAGCAGCGCGACCAGCATCACGAGTGCGAGGAGGACGTCGCCCAGGCGCTTGAGG  CIHCIIICIIICIIIDEEIIIGI?CIIIIIDIIIIIEEIDIICBIICIIIIEEGIIIIIIBCIIIICEIDCICIICIICIICIICDICCIDIIIIIDIICIIIIICIIDIICEIDIIDIEIIIDIIDIICIIEIIIIIDIIIIIEEIDII  NM:i:0  MD:Z:150    MC:Z:150M   MQ:i:60 AS:i:150    XS:i:0
FT100012261L1C006R00300498314   99  k127_43926  72  40  127S19M4S   =   72  19  GCCCCCCCTGTCTTCGGCTTCCTCGGCGCCTTCCTTCTTGATGTGACCCTCGACGTCGAGCTCCTCGTTGTTCAGCAGGTTGCTCGATAAGTTGAGCAGCTCATCCGACATGTCTCCTCCTTCGGGAGTCCGTTGCGCCCGAGCCGTGCA  IIIIIIIIDICIDDIIBICD@IDIIIIDIIDDIIDDIDD@DCICIBIIICIF7IFCIIEIICIICIHCCCCCICHIBIICDI=CI=6C8DGCCCEAIEIIDIDCII6DIDCICICIICIICCII?IDHAIIICC:IIIII)?@IIGAII?  NM:i:0  MD:Z:19 MC:Z:15S19M116S MQ:i:40 AS:i:19 XS:i:0
FT100012261L1C006R00300498314   147 k127_43926  72  40  15S19M116S  =   72  -19 TCTCCTCCTTCGGGAGTCCGTTGCGCCCGAGCCGTGCAGTTACTCCTCCAGACGTGGTGGAGGAGCGAAAGCTTCCCGGCATTGTCTCGCTCTCCGAGAAGCGATTCAATCCCTCGATGAGCCGGTTCCGTGTGCTGGCGTCGTCCCTGC  DI?HIDGIBEIIIICIBIIIEEIIIIIIICIIHIEIIDIEECIDIIDIIDICIIEIIEIIDIIDIIIDDCIIDEIIIIIICEEIEIEIIIDIEIIIDIDCIIIDEEIDDEIIIEIIDDIDIIIIIEEIIIEIEIIEIIIIEIIEIIIDII  NM:i:0  MD:Z:19 MC:Z:127S19M4S  MQ:i:40 AS:i:19 XS:i:0
FT100012261L1C003R00101959697   163 k127_43926  111 60  150M    =   165 204 GGTTGCCGGCGAACGCCATCAACAGCACGACGACGATCAACAGCGGGAGCAGCGCGACCAGCATCACGAGTGCGAGGAGGACGTCGCCCAGGCGCTTGAGGAGCAGCGAAACGCGGGCACGCGGGCGCCCGAGCCGTTGCGCGGCTCGTG  IIDDIIIIIIIEEIIIIEDIEEIEIIEIIEIIEIIECIEEIEIIIIIEIIEIIIIIEIIEIIECIEIIEICIIIEIIDIIDIICIIIIIEIIIIICCIDIIEIIEIIIEEEIIIIIIIEIIIIIIIIIIIIDIIIICCIIIIIIICIICI  NM:i:0  MD:Z:150    MC:Z:150M   MQ:i:60 AS:i:150    XS:i:0

and this is from the read files.

@FT100012261L1C001R00100000393/1
TTTGCCGGACAACCAGCCGCCGCCCAAGGGGGACCAAGGCAGCAGCCCGATCCCCGCATCGAGCGCCGCCGGGACGATCTCGAATTCGATGTCCCGGACAAGAAGGCTGTACTGCGGCTGCAGCCTCACCGGCGGGTTCCAGCCTTGGCT
+
CCCIIIF<6I@EII;;II=IIIIII>>IB?II.IIE:I:I@DI><IIIIC=IIIIII:CII5=I6IIIIIIIIDIH,BICA>E0:?II*BICIII=.@IE-GD;AI05(=5ICII0IIC/I<I)IBI-I0@FI.FIB=II,-II*C6II;
@FT100012261L1C001R00100000429/1
TCCAGCGGCACGCCTAGCTCCTGCGCGACCGTTTCCAACGACCGGGGCGTGCTCAGAATGTTTTTACGCTCGAAATTGCCCCGCTCGGTCACGTCGAAGTAGCGCAGCACCAGTGGTGCGTCGTCGGGTCCTACCACGGCAGTAATCTCG
+
CIIEDII<IDIIIICEIICIICIIIIIDIIICC<IIDEIFDIIIIIII<C;IBIEIE@CH5BCCCEIIICIHEDDCCIIIIIIHCI9IAIEIHBIHDBIC8IIHIDHIEIIEF<IACHII+IC7IA-IBIIC<IIEII/ID-CD@CICIA
@FT100012261L1C001R00100000695/1
CGAGGAAGAGGCCCGCGACCAGATCGAGACCAACCTGTTCGGCGCCCTGTGGATCACCCAGGCTGCGCTGCCCATCCTGCGTGAGCAGGGGAGCGGTCACATCATCCAGGTGTCGTCGATCGGGGGCATCTCCGCGTTCCCGCGCGTCGG

I look forward to hearing from you.

Best,

Nhu

sheaster commented 4 weeks ago

Hi All, yesterday, I am pretty sure I was able to get RAT running with a similar bam file by changing "fasta_dict[seq]" to "fasta_dict[seq_id]" in line 1259 of reads.py: outf.write('>{0}{1}\n{2}\n'.format(seq_id, suffix, fasta_dict[seq]))

Cheers, Shea

lucianhu commented 4 weeks ago

Hello Thauptfeld,

Thank you for considering my feedback. Yesterday, I tried your updated code, but there are still some issues.

[M::mem_process_seqs] Processed 628586 reads in 129.108 CPU sec, 5.694 real sec
[main] Version: 0.7.18-r1243-dirty
[main] CMD: bwa mem -t 24 /path/to/Phase_1_1.contigs.fa ./Phase_1_1_1.fastq ./Phase_1_1_2.fastq
[main] Real time: 114.846 sec; CPU: 2686.158 sec
[2024-11-02 10:05:31] Sorting bam file...
[bam_sort_core] merging from 0 files and 24 in-memory blocks...
[2024-11-02 10:05:39] Read mapping done!

[2024-11-02 10:05:39] contig2classification file supplied. Processing contig classifications.
[2024-11-02 10:05:39] Loading file /path/to/20240422_CAT_nr/tax/nodes.dmp.
[2024-11-02 10:05:41] Processing mapping file(s).

[2024-11-02 10:06:10] Chosen mode: cr. Classifying unclassified contigs and unmapped reads with diamond if no classification file is supplied.
[2024-11-02 10:06:10] No unmapped2classification file supplied .Grabbing unmapped and unclassified sequences...
[2024-11-02 10:06:11] Contigs written! Appending forward reads...
Traceback (most recent call last):
  File "/path/to/CAT_pack", line 101, in <module>
    main()
    ~~~~^^
  File "/path/to/CAT_pack", line 85, in main
    reads.run()
    ~~~~~~~~~^^
  File "/path/to/CAT_pack/reads.py", line 435, in run
    make_unclassified_seq_fasta(reads_files[0], unmapped_reads['fw'],
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                                uncl_unm_fasta, 'fastq', 'a','_1')
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/CAT_pack/reads.py", line 1260, in make_unclassified_seq_fasta
    fasta_dict[seq]))
    ~~~~~~~~~~^^^^^
KeyError: 'FT100012261L1C009R00400078634'

Sincerely,

Nhu

thauptfeld commented 3 weeks ago

Hi Nhu,

the issue with the run seems to be that in the read files, the read id's have a /1 at the end (or a /2 in the reverse file I assume), which is removed during the mapping. So, in the bam file they don't have the /1 at the end. Therefore, RAT is looking for a read id that doesn't exist in the read file. RAT is supposed to be able to handle this for '/1' and '_1', so I'll have to check what is going on.

Will let you know.