MGXlab / CAT_pack

CAT/BAT/RAT: tools for taxonomic classification of contigs and metagenome-assembled genomes (MAGs) and for taxonomic profiling of metagenomes
MIT License
187 stars 30 forks source link

Error reading temporary file of Diamond #20

Closed MikaUhr closed 5 years ago

MikaUhr commented 5 years ago

Hi,

When running "CAT contigs" I get this error: No such file or directory Error: Error reading file /CAT_outout_directory/diamond-tmp-J1k18Q terminate called after throwing an instance of 'File_read_exception' what(): Error reading file /CAT_outout_directory/diamond-tmp-J1k18Q [2019-04-03 17:03:41.921512] ERROR: Diamond finished abnormally. [2019-04-03 17:04:51.988192] ERROR: input file /CAT_outout_directory/sample.CAT.ORF2LCA.txt does not exist.

And my script is: `#$ -pe smp 16

$ -l h_vmem=8G

CAT contigs -n 16 -c ${Scaffold} -d ${Dir_DB} -t ${Dir_taxon} -o ${Dir_out}/${Sample} \ --proteins_fasta ${Protein} \ --path_to_prodigal /usr/.pyenv/shims/prodigal \ --path_to_diamond /usr/.pyenv/shims/diamond`

Version of dependencies are: DIAMOND: 0.9.14 Prodigal: V2.6.3 Python 3: 3.6.8

Best, Mika

bastiaanvonmeijenfeldt commented 5 years ago

Hi Mika,

Could you send me the entire CAT log?

Bastiaan

MikaUhr commented 5 years ago

Hi Bastiaan,

Thank you for your help.

This is the entire CAT log.

# CAT v4.3.3.

CAT is running. Since a predicted protein fasta is supplied, only alignment and contig classification are carried out.
Rarw!

Supplied command: /home/usr/Analysis/scaf/scaffoldSeq.fasta -d /home/usr/CAT/CAT_prepare_20181212/2018-12-12_CAT_database -t /home/usr/CAT/CAT_prepare_20181212/2018-12-12_taxonomy -o /home/usr/Analysis/cat/sample --proteins_fasta /home/usr/Analysis/cat/sample.predicted_proteins.faa --path_to_prodigal /home/usr/.pyenv/shims/prodigal --path_to_diamond /home/usr/.pyenv/shims/diamond

Contigs fasta: /home/usr/Analysis/scaf/scaffoldSeq.fasta
Taxonomy folder: /home/usr/CAT/CAT_prepare_20181212/2018-12-12_taxonomy/
Database folder: /home/usr/CAT/CAT_prepare_20181212/2018-12-12_CAT_database/
Parameter r: 10
Parameter f: 0.5
Log file: 0.5

-----------------

Doing some pre-flight checks first.
[2019-04-03 18:12:02.154792] Diamond found: diamond version 0.9.14.
Ready to fly!

-----------------

[2019-04-03 18:12:02.457721] Importing contig names from /home/usr/Analysis/scaf/scaffoldSeq.fasta.
[2019-04-03 18:12:04.839788] Parsing ORF file /home/usr/Analysis/cat/sample.predicted_proteins.faa
[2019-04-03 18:12:07.458773] Homology search with Diamond is starting. Please be patient. Do not forget to cite Diamond when using CAT or BAT in your publication!
                query: /home/usr/Analysis/cat/sample.predicted_proteins.faa
                database: /home/usr/CAT/CAT_prepare_20181212/2018-12-12_CAT_database/2018-12-12.nr.dmnd
[2019-04-03 22:27:04.436939] ERROR: Diamond finished abnormally.

This problem seems to be the same as #15 .

15 questioner said this:

I've successfully increased the memory usage of Diamond search step by adding '--block-size' and '--index-chunks' options to 'shared.py' file.

But I can't find 'shared.py' file.

Thanks, Mika

bastiaanvonmeijenfeldt commented 5 years ago

Hi Mika,

Since the error is in the Diamond alignment step, it's a little harder to debug this. If you use a small subset of your data, does CAT finish? You can for instance make a .fasta file that only contains the first 10 scaffolds. I suspect this will run but could you confirm this for me?

If the problem is indeed similar to #15, you're machine has run out of disk space, and I would have expected a different error message from Diamond, but let's test this as well! You can check available disk space in Linux with the command df -h. The temporary solution mentioned in #15 is by changing the shared.py file within the CAT directory: CAT_pack/shared.py. Line 79 to 88 of the code generates and calls the Diamond command:

        subprocess.check_call([path_to_diamond,
                               'blastp',
                               '-d', diamond_database,
                               '-q', predicted_proteins_fasta,
                               '--top', '50',
                               '--matrix', 'BLOSUM62',
                               '--evalue', '0.001',
                               '-o', diamond_file,
                               '-p', str(nproc),
                               '--quiet'])

This would translate into a shell command like this: diamond blastp -d nr.dmnd -q predicted_proteins.fasta --top 50 --matrix BLOSUM62 --evalue 0.001 -o out.alignment.diamond -p 24 --quiet. If you would for instance remove the '--quiet' in line 88 Diamond will be much more vocal, or as another example you can add '--block-size', '1', '--index-chunks', '1', to get temporary disk space down.

Having said all this (for reference), I will try to put a new code online soon (within days) that will add these Diamond options. So you can either do this yourself or wait a little more...

Let me know if this helps!

Bastiaan

bastiaanvonmeijenfeldt commented 5 years ago

Hi Mika,

I have just published a new release (v4.4) where you can tune these DIAMOND specific parameters! Let me know if that solved your issue!

Bastiaan

MikaUhr commented 5 years ago

Hi Bastiaan,

Thank you for your advice and the release of new version!

I tried to : (1) Use a small subset of the data The file containing the 10 long scaffold (5.8 M) was stopped, but the 10 short scaffolds (1.1 K) worked well.

(2) Use a new version (v4.4) I got this error:

Traceback (most recent call last):
  File "/home/usr/tools/CAT/CAT-master/CAT_pack/CAT", line 72, in <module>
    main()
  File "/home/usr/tools/CAT/CAT-master/CAT_pack/CAT", line 56, in main
    contigs.run()
  File "/home/usr/tools/CAT/CAT-master/CAT_pack/contigs.py", line 574, in run
    contigs(args)
  File "/home/usr/tools/CAT/CAT-master/CAT_pack/contigs.py", line 236, in contigs
    tmpdir) = check.convert_arguments(args)
  File "/home/usr/tools/CAT/CAT-master/CAT_pack/check.py", line 46, in convert_arguments
    tmpdir = out_prefix.rsplit('/', 1)[0]
NameError: name 'out_prefix' is not defined

My script is:

#$ -pe smp 16
#$ -l h_vmem=4G

CAT contigs -n ${Thread} -c ${Scaffold} -d ${Dir_DB} -t ${Dir_taxon} -o ${Dir_out}/${Sample} \
    --proteins_fasta ${Protein} \
    --path_to_prodigal /home/usr/.pyenv/shims/prodigal \
    --path_to_diamond /home/usr/.pyenv/shims/diamond \
    --sensitive \
    --block_size 1 \
    --nproc ${Thread}

I installed CAT v4.4 and suppled the absolute path.

Thanks, Mika

bastiaanvonmeijenfeldt commented 5 years ago

Woops that error is on me, I'm sorry! I have pulled the release from yesterday and put one with a bugfix online (still named v4.4, just downloaded the latest). Could you try again with this version?

Also, what kind of machine are you running this on? Am I correct that you are using 16 cores and 4GB of RAM? How many free disk space do you have?

Sorry for Yesterday's bug, it's already confusing enough with one bug. :)

Best,

Bastiaan

MikaUhr commented 5 years ago

Dear Bastiaan,

Thank you for your reply.

I tried today's version but got this error:

Traceback (most recent call last):
  File "/home/usr/tools/CAT/CAT-master/CAT_pack/CAT", line 72, in <module>
    main()
  File "/home/usr/tools/CAT/CAT-master/CAT_pack/CAT", line 56, in main
    contigs.run()
  File "/home/usr/tools/CAT/CAT-master/CAT_pack/contigs.py", line 574, in run
    contigs(args)
  File "/home/usr/tools/CAT/CAT-master/CAT_pack/contigs.py", line 394, in contigs
    if not check.check_fasta_file(predicted_proteins_fasta):
AttributeError: module 'check' has no attribute 'check_fasta_file'

"CAT -v" return: CAT v4.4 (April 9, 2019) by F. A. Bastiaan von Meijenfeldt.

Also, what kind of machine are you running this on? Am I correct that you are using 16 cores and 4GB of RAM? How many free disk space do you have?

-> I'm running on Linux. I can use 32 core CPU and 187.6G RAM.

Best, Mika

bastiaanvonmeijenfeldt commented 5 years ago

Ah I see I was a little too eager to push the fix to you without running the code at least once. I have now done that and the latest version runs on our systems (again just download the latest release, v4.4).

We have run CAT on systems with much less cores and RAM, so that shouldn't be an issue. So let's see if the --block_size set to 1 will make the difference! If that does not work out we will start debugging DIAMOND.

Running DIAMOND in sensitive mode will usually take a lot longer by the way, and we generally do not notice great improvements in performance. Do let us know if it makes a difference for you!

Best wishes,

Bastiaan

MikaUhr commented 5 years ago

Thank you for your help.

The latest version could also be run on my system. Although I tried to use options of few combination such as --block_size set to 1 and 0.2 and sensitive mode, I received the same error as before. I’ll try to use DIAMOND of the latest version.

I have two questions:

1) I'll make the database files myself to use the latest version DIAMOND. (Because CAT_prepare_20190108.tar.gz is generated by DIAMOND v0.9.14) What is default database files on tbb.bio.uu.nl/bastiaan/CAT_prepare/ downloaded from? Is the bacterial complete genomes RefSeq data? Please tell me about how to download NCBI taxonomy files and nr database files.

2) I ran CAT using a small scaffolding file. After I got a named CAT classification file, "CAT Summary" gave this error:

[2019-04-10 21:46:53.353757] ERROR: /home/proj/usr/cat/Sample.ORF2LCA_names.txt is not a CAT classification file.

My script is:

CAT add_names -i ${Dir_out}/${Sample}.ORF2LCA.txt -o ${Dir_out}/${Sample}.ORF2LCA_names.txt -t ${Dir_taxon} CAT summarise -c ${Scaffold} -i ${Dir_out}/${Sample}.ORF2LCA_names.txt -o ${Dir_out}/${Sample}.summary.txt

The contents of "${Dir_out}/${Sample}.ORF2LCA_names.txt" is:

'# ORF lineage bit-score full lineage names contig1_1 1;131567;2 135.2 root (no rank) cellular organisms (no rank) Bacteria (superkingdom) contig2_1 ORF has no hit to database. contig3_1 1;131567;2;1783270;68336;976;200643;171549 135.6 root (no rank) cellular organisms (no rank) Bacteria (superkingdom) FCB group (no rank) Bacteroidetes/Chlorobi group (no rank) Bacteroidetes (phylum) Bacteroidia (class) Bacteroidales (order) contig4_1 1;131567;2 116.7 root (no rank) cellular organisms (no rank) Bacteria (superkingdom) contig5_1 ORF has no hit to database. contig5_2 ORF has no hit to database.

Best, Mika

bastiaanvonmeijenfeldt commented 5 years ago

Hi Mika,

Could you send me by email your contigs fasta? You can find my email on my webpage. Than I'll debug this error myself.

1) The files on tbb.bio.uu.nl are generated in exactly the same way as when you would run ./CAT prepare --fresh. Inside those database files is a log file as well, you can see there where all the downloads come from. If you want to make the database files yourself I would suggest to just run ./CAT prepare --fresh. The version of DIAMOND is not that important as long as you run the same version for contig classification. 2) You can currently not run CAT summarise on an ORF2LCA file, but only the contigs2classification file, that's why CAT gives you this error. Moreover you can only summarise a file that is named with the --oficial_names tag enabled. We're still evaluating what the summarise mode should do depending on how people use it. I for one think a mode that gives you back a community profile (based on read mappings) would be useful, summarise now only gives you a profile per sequence length. So we are thinking about this. :)

So if you send me you contig file I'll fix this error and report back here for future reference.

Best wishes,

Bastiaan

bastiaanvonmeijenfeldt commented 5 years ago

I close this issue now as the contigs could be classified on our systems.