MGXlab / CAT_pack

CAT/BAT/RAT: tools for taxonomic classification of contigs and metagenome-assembled genomes (MAGs) and for taxonomic profiling of metagenomes
MIT License
194 stars 31 forks source link

Suggestions for better documentation on RAT inputs #131

Open aoliver44 opened 2 months ago

aoliver44 commented 2 months ago

Thank you for such a fantastic tool.

I have been trying to get RAT to work. Your documentation specifies "Currently, RAT supports single read files as well as paired-end read files. Interlaced read files are currently not supported."

  1. Would you consider clarifying how to use RAT with single read files? I would assume something like:

reads --mode cr -c 5002.filtered.fasta -1 reads_to_map.fastq -d $DB -t $TAX --c2c 5002_CAT.contig2classification.txt

But that throws an error:

File "[...]", line 442, in run
    make_unclassified_seq_fasta(reads_files[1], unmapped_reads['rev'],
IndexError: list index out of range

Im terrible at python...but in looking at, I could not see how the -2 flag could be optional (ie in the case of single read files).

For troubleshooting, I tried supplying an empty -2 reads_to_map_2.fastq file seemed to get it to run further. I think this is because the process looks like it appends the reads and unidentified contigs all in one big file, and the pairing doesn't really matter?

  1. Would you consider clarifying if there is a requirement to the read files? Do they have to be fastq? Line ~435 of appears to hard encode 'fastq' in the make_unclassified_seq_fasta() function. Again, I am very very weak with python though, I might be wrong. If they do have to be fastq, it might be helpful to highlight that requirement in the documentation and help.
lucianhu commented 1 month ago

Hello Andrew,

Have you successfully run this RATtool yet? I encountered an error related to the input. From what I understand, my FASTQ headers must be present in the contigs. My contigs were generated by MEGAHIT, and they only have names like >k127_10000, without any FASTQ names. What do you think I should do?

Best regards,


./CAT_pack reads --mode cr -c contigs.fa -1 R_1.fastq -2 R_2.fastq -d /path/to/20240422_CAT_nr/db/ -t /path/to/200422_CAT_nr/tax/ -o lala -p out.CAT.predicted_proteins.faa -a out.CAT.alignment.diamond --c2c out.CAT.contig2classification.txt

<frozen genericpath>:39: RuntimeWarning: bool is used as a file descriptor
# CAT_pack v6.0.

[2024-10-11 09:41:47] samtools found: samtools 1.21.
[2024-10-11 09:41:47] bwa found: Version: 0.7.18-r1243-dirty.

RAT is running. Mapping reads against assembly with bwa mem.

[2024-10-11 09:41:47] Running bwa mem for read mapping. File fastq.bwamem.sorted will be generated.Do not forget to cite bwa mem and samtools when using RAT in your publication!
[2024-10-11 09:41:47] Contigs fasta is already indexed.
[2024-10-11 09:41:47] Running bwa mem...
[main] Version: 0.7.18-r1243-dirty
[main] CMD: bwa mem -t 24 
[main] Real time: 118.429 sec; CPU: 2701.163 sec
[2024-10-11 09:43:52] Sorting bam file...
[bam_sort_core] merging from 0 files and 24 in-memory blocks...
[2024-10-11 09:44:00] Read mapping done!

[2024-10-11 09:44:00] contig2classification file supplied. Processing contig classifications.
[2024-10-11 09:44:00] Loading file 20240422_CAT_nr/tax/nodes.dmp.
[2024-10-11 09:44:03] Processing mapping file(s).

[2024-10-11 09:44:32] Chosen mode: cr. Classifying unclassified contigs and unmapped reads with diamond if no classification file is supplied.
[2024-10-11 09:44:32] No unmapped2classification file supplied .Grabbing unmapped and unclassified sequences...
[2024-10-11 09:44:33] Contigs written! Appending forward reads...
Traceback (most recent call last):
  File "CAT_pack/CAT_pack/./CAT_pack", line 101, in <module>
  File "CAT_pack/CAT_pack/./CAT_pack", line 85, in main
  File "CAT_pack/CAT_pack/", line 435, in run
    make_unclassified_seq_fasta(reads_files[0], unmapped_reads['fw'],
                                uncl_unm_fasta, 'fastq', 'a','_1')
  File "CAT_pack/CAT_pack/", line 1260, in make_unclassified_seq_fasta
KeyError: 'FT100012261L1C013R00202177739'