borenstein-lab / MetaLAFFA

A bioinformatics pipeline for annotating functional capacities in shotgun metagenomic data with native compute cluster integration
GNU General Public License v3.0
7 stars 3 forks source link

Missing $CONDA_PREFIX/MetaLAFFA/config/steps? #7

Open padbc opened 3 years ago

padbc commented 3 years ago

Dear all,

I am trying to modify the cluster_params parameter, but I can't seem to find the map_read_to_genes.py script -- in fact, the entire config folder is missing from the conda installation. I installed the MetaLAFFA using mamba, btw. I would appreciate any input you may have.

engal commented 3 years ago

Just to confirm, you're trying to modify the default parameters for your MetaLAFFA installation by modifying the base config folder? If so, then when you activate your MetaLAFFA environment and try to locate the base config folder at

$CONDA_PREFIX/lib/python3.6/config

you can't find it, correct?

padbc commented 3 years ago

Thank you for your quick reply. That's indeed the case. And I can't find the MetaLAFFA/config/steps/ folder, either. (In my case, the full path is /home/{username}/miniconda3/envs/metalaffa/MetaLAFFA . (Incidentally, my uniref90_to_ko.map file is also empty, but I don't know the extent to which both problems are related.)

engal commented 3 years ago

To clarify, there should not be a MetaLAFFA/config folder in that environment. Instead, the config folder should be located at /home/{username}/miniconda3/envs/metalaffa/lib/python3.6/config.

Regarding an empty uniref90_to_ko.map file, do you have the database preparation output and error logs (should be located in the same directory as the uniref90_to_ko.map file)?

padbc commented 3 years ago

Brilliant, thank you -- I found the config folder.

As to your second point, neither the logs or database preparation steps can be found in the gene_to_ortholog_map/ folder. However, the prepare_database.py finishes without any obvious error:

Beginning processing of reference data.

Downloading human reference.
Creating human reference Bowtie 2 index.

Downloading ortholog-to-grouping mappings.

Downloading UniProt UniRef90 database.
Creating UniRef90 DIAMOND database.
Creating UniRef90 gene length table.
Downloading UniRef90 gene-to-ortholog mapping.
Formatting UniRef90 gene-to-ortholog mapping.

********************************************************************************
SETUP COMPLETE
Automated reference data processing has finished. Please see above for details on supporting data files that could not be generated.

End of database preparation.
engal commented 3 years ago

Given that the expected file is present, but empty, it seems possible that the subprocess generating the mapping file is running out of memory and terminating without throwing an error that alerts Python. In this case, you can try rerunning the prepare_databases.py script with additional allocated memory. Assuming the other expected data files have been prepared correctly, you can either delete the empty mapping file and rerun the prepare_databases.py script or run the script using just the "-u" and "-f" flags to force the script to prepare the UniProt-associated databases again.

I also have a few questions to hopefully narrow down what is causing this issue if the above does not work:

  1. Since there was some confusion regarding the location of the config folder, I just to confirm that the default location contains the empty file, which should be located at /home/{username}/miniconda3/envs/metalaffa/MetaLAFFA/gene_to_ortholog_maps/uniref90_to_ko.map by default?
  2. Have you modified any config parameters?
  3. Does the data file used to generate the gene-to-ortholog mapping exist (should be located at /home/{username}/miniconda3/envs/metalaffa/MetaLAFFA/databases/idmapping.dat.gz)?
  4. How are you running prepare_databases.py (e.g. directly in a terminal session, in a screen session, via a job on a cluster, etc.)?
padbc commented 3 years ago

Thank you!

"you can try rerunning the prepare_databases.py script with additional allocated memory". How should one do this? I re-ran "prepare_databases.py" after deleting the map file, but got this message:

Warning: The gene-to-ortholog table (/home/pedro/miniconda3/envs/metalaffa/MetaLAFFA/gene_to_ortholog_maps/uniref90_to_ko.map) does not exist. You will be unable to perform the default ortholog mapping step (and the hit filtering step if using either the 'best_ortholog' or 'best_n_orthologs' methods) without it. Make sure that 'gene_to_ortholog_directory' in config.file_organization and 'gene_to_ortholog_file' in config.operation.py are correct.
End of database preparation.

To answer your questions:

  1. Yes.
  2. No. I ran prepare_databases as described in the tutorial.
  3. Yes, I have the idmapping.dat.gz file.
  4. I am running prepare_databases directly in terminal.
engal commented 3 years ago

Did you run prepare_databases.py with any options? You'll at least need the "-u" option so it knows to try preparing the UniProt-associated databases. By default, it will skip over any that already exist, so by deleting the incorrect map file, it should only take the time to try regenerating that file.

Based on your responses, it seems most likely that it's a memory issue.

padbc commented 3 years ago

Thank you! As I said, the first time I used the tutorial guidelines; i.e. -hr -km -u. The second time, still running, I used the -u flag.

If it's a memory issue, I'm not seeing that: I don't use more than than 10% of available memory. How can I troubleshoot this? Thanks again.

engal commented 3 years ago

Sorry, if you used the -u flag, did you get any additional output beyond

Warning: The gene-to-ortholog table (/home/pedro/miniconda3/envs/metalaffa/MetaLAFFA/gene_to_ortholog_maps/uniref90_to_ko.map) does not exist. You will be unable to perform the default ortholog mapping step (and the hit filtering step if using either the 'best_ortholog' or 'best_n_orthologs' methods) without it. Make sure that 'gene_to_ortholog_directory' in config.file_organization and 'gene_to_ortholog_file' in config.operation.py are correct.
End of database preparation.

There should be more including messages saying that it is performing certain steps and/or skipping unnecessary steps. Did you get any other messages, or just that warning?

padbc commented 3 years ago

I was able to rerun prepare_databases with the -u flag and no messages or warnings. However, I again found an empty uniref90_to_ko.map file. This issue, however, does not seem to be caused by a lack of memory. It seems to me that the problem may stem from the current version of idmapping: could it be lacking KO definitions?

For now, I'll run MetaLAFFA without the uniref90_to_ko.map file. What are the possible ortholog types of the create_uniref_gene_to_ortholog.py script?

engal commented 3 years ago

You're correct, it looks like they removed KO mappings from the current version of the file on UniProt. They still have mappings for organism-specific genes in the KEGG database, but not ortholog classifications. This means that the latter steps of MetaLAFFA (ortholog abundance correction and ortholog aggregation) will no longer work under the default configuration since the default approaches rely on KO IDs.

Regarding other ortholog types, you can map to any database that exists in the second column of the idmapping.dat.gz file. For example, you can map to RefSEq, eggNOG, BioCyc, etc. To create the mapping table, first go the the operation.py submodule in the config module and change target_ortholog to your new choice (line 95). Then rerun the database preparation script, and that should generate the new mapping file.

padbc commented 3 years ago

Great -- thank you. Would it be possible to post/share the uniref90_to_ko.map file from previous versions? And just to make sure, the uniref90_to_ko.map file should have the following format:

Q6GZS4  K12408
Q92AT0  K21298
P48347  K06630
...     ...

without column headers. Or is it the other way around -- KO numbers go in the first column and UniProt IDs go in the second?

Thanks again for your help.

engal commented 3 years ago

The format for the mapping file is ortholog first, then sequence database ID, e.g. KO first, then UniProt ID. There are no column headers.

Regarding posting or sharing the file, the compressed version of the file is still larger than GitHub's 25 MB size limit for attachments and for size reasons we would not want to add it the repository. I can look into other ways to potentially host the file.

It may be worth contacting UniProt to see if there is a reason the KO annotations were removed. Ideally, MetaLAFFA would rely on the data available from UniProt, meaning that users could easily keep both the sequence database and the mapping file up-to-date, and more importantly, synced. It's possible that using an older version of the mapping file with a newer version of the sequence database could lead to issues.

padbc commented 3 years ago

Thank you, @engal. Since the main MetaLAFFA approach is predicated upon the existence of KO annotations, would it make sense to make this clear on the software's landing page, and potentially suggest an alternative?

With regard to the UniProt-KO cross-references, I will contact UniProt to see what's happening. While using an outdated uniprot-ko mapping is not ideal, I wonder how big of an impact this would have, considering that KOs are aggregated into pathways/modules.

engal commented 3 years ago

Thanks for the suggestion. I've also contacted UniProt to ask them about this, and I'm currently looking into whether accessing data from previous UniProt releases is an option. You may be able to use this option to solve your current issue (you can browse previous UniProt releases here), however it looks like previous releases have the entire dataset compressed as a single archive (located under previous_releases/release-XXXX_XX/knowledgebase/knowledgebaseXXXX_XX.tar.gz). You can try downloading one of these archives and extracting the idmapping.dat file for your MetaLAFFA installation. I am not currently implementing this option as the default for MetaLAFFA installation due to the large size of such archives (~130 GB), rather than the ~20 GB of the idmapping.dat.gz file alone. However, if the response from UniProt indicates that future releases will not have KO mappings, then I will switch to using historical UniProt releases as the default for MetaLAFFA installation.

engal commented 3 years ago

Quick update, I heard back from UniProt and it is indeed a KEGG licensing issue that led them to remove their KO mappings. I'm working on an update to MetaLAFFA that will switch to using the last available version of UniProt where KO mappings were available. I'll update this issue once the fix is done.

padbc commented 3 years ago

Thank you -- that sounds great. (Incidentally, the latest idmapping.dat file does not include Gene Ontologies, which are also amenable to aggregation into higher-level functional categorizations.)

engal commented 2 years ago

Sorry for the delay, but I've now modified the default prepare_databases.py script to instead use the most recent version of the UniProt database that also includes KO mappings (from 2020). Now KO mappings should always be available when using the default installation, assuming UniProt does not modify their archives.

These changes are in the most recent MetaLAFFA release (1.0.1). Please let me know if you have any further issues.

padbc commented 2 years ago

Thank you for letting me know. I will test this shortly. How recently has this new version been tested? The https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/ site now appears to be empty.

engal commented 2 years ago

I'm not sure why you are unable to see the contents, though when I first clicked the address you posted, nothing showed up, but after refreshing the page was fine. This component of the package was tested last week, so unless they made any changes since then, it should be fine. Wget still works to access the contents that should be there.