biosustain / memote-meta-study

Test metabolic models in the wild and rank them according to popularity and how easily they can be improved.
Apache License 2.0
3 stars 2 forks source link

Issues with docker #23

Closed franciscozorrilla closed 4 years ago

franciscozorrilla commented 4 years ago

Hi,

I am trying to re-run the memote meta analysis you so kindly and reproducibly shared. However, when I run make plots on the cluster I run into problems with the docker commands:

$ make plot
find . -type f -name "*.py[co]" -delete
find . -type d -name "__pycache__" -delete
rm -rf supplements_cache/*
rm -rf supplements_files/*
jupyter nbconvert --to notebook --ExecutePreprocessor.timeout=600 \
    --execute --inplace reports/clustering_metric_data.ipynb
[NbConvertApp] Converting notebook reports/clustering_metric_data.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python3
[NbConvertApp] Writing 506410 bytes to reports/clustering_metric_data.ipynb
jupyter nbconvert --to notebook --ExecutePreprocessor.timeout=600 \
    --execute --inplace reports/clustering_scored_data.ipynb
[NbConvertApp] Converting notebook reports/clustering_scored_data.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python3
[NbConvertApp] Writing 443450 bytes to reports/clustering_scored_data.ipynb
docker run -v /g/scb2/patil/zorrilla/memote/memote-meta-study:/home/rstudio --tty midnighter/knit-memote:3.6.1 \
    Rscript scripts/plot_panel.R
make: docker: No such file or directory
make: *** [Makefile:39: plot] Error 127

Is the docker run -v /g/scb2/patil/zorrilla/memote/memote-meta-study:/home/rstudio --tty midnighter/knit-memote:3.6.1 Rscript scripts/plot_panel.R command pointing to an incorrect or missing file or directory?

Additionally I suspect there may be an issue with the docker software/installation on my cluster, as I am unable to invoke docker -h without the error -bash: docker: command not found. Docker is indeed installed according to pip list, and it does seem like the PATH variable includes the installation directory of the docker package.

Please let me know if you have any suggestions or require any additional information. Best wishes

Midnighter commented 4 years ago

Hey,

Can you invoke docker version on the cluster? Our cluster didn't actually allow running docker (probably yours neither) as you would need too many privileges. I pulled the results from the cluster to my local machine and ran this command there.

franciscozorrilla commented 4 years ago

You are right, I cannot invoke docker version on the cluster. Would it be possible to use a singularity image instead of a docker image? It seems like singularity is supported on the EMBL cluster at least.

Thanks for the tip, I am now setting up the repo/environment locally and will try to run the makefile once the setup is complete.

Midnighter commented 4 years ago

You should be able to convert any Docker image into a Singularity one

https://sylabs.io/guides/3.6/user-guide/singularity_and_docker.html#making-use-of-public-images-from-docker-hub

franciscozorrilla commented 4 years ago

Good to know, I may try running on the cluster using singularity in the future, for now it looks like I was able to re-run the analysis locally. It should be possible to modify the clustering_metric_data.ipynb and clustering_scored_data.ipynb scripts to include memote report summaries for additional/new GEM sets, correct? Thanks again!

Midnighter commented 4 years ago

To be honest, if I were to repeat the analysis today, I would create a nextflow.io pipeline for all of it.

You may not even need to run the clustering notebooks, they just generate PCA, t-SNE, and UMAP embeddings with default settings. As long as you have your models/results either in the existing directories or add your new directories to the relevant scripts, everything should be included.

franciscozorrilla commented 4 years ago

I forked the repo and modified all the scripts/reports I could identify to include my dataset. For some reason the generated supplementary materials pdf still does not include my model set, but I was able to obtain the main figure I was going for. Thanks for the help! And in case you are curious about how gut microbiome MAG-based GEMs reconstructed using carveme compare to the other sets:

manuscript_panel_figure

Midnighter commented 4 years ago

I'd be happy to help you build the full supplements with your models.

Anyway, that looks very interesting. I wouldn't have thought that your models will be so far away from the general carveme models (since there should be quite some overlap in the species) but you might have used a newer version of CarveMe that performs differently with memote in general.

So you also have less genetic evidence for reactions? Does that mean you need to do more gap filling in order to achieve biomass formation?

franciscozorrilla commented 4 years ago

Thanks for offering your help! In any case, this was a bit of a test run, since I also want to include a larger MAG-based GEM set extracted from TARA oceans data + some smaller MAG-based GEM sets extracted from soil, plant associated, and lab culture metagenomics data.

Yes, I also expected at least some overlap between daniel's EMBL GEMs and my gut GEMs in the t-SNE plot. As you suggest, it could be differences in the carveme versions used. I could try re-carving the refseq genomes with the same carveme version I used for my gut GEMs, although I do not know if I will be able to get my hands on the exact version of the database (release 84) that daniel used.

I also suspect that daniel did not perform additional gap-filling for the NCBI refseq genomes (although I cannot verify this in the paper), whereas I gapfilled with dGMM + LAB media (M3 from this paper, fig 1D). Note that I invoked the gapfilling media within the original carve command, which may yield different results compared with gapfilling after carving, since the former approach allows for the gene annotation scores to prioritize the reactions selected for gap-filling based on genetic evidence.

Regarding the fact that there is less genetic evidence for reactions in my MAG-based models, I suspect that this may be due to a combination of the following:

  1. Less-than-perfect-quality MAGs: The quality threshold of medium quality MAGs is quite lenient, needing only >50 % completeness & <10% contamination (as determined by CheckM) to qualify as MQ. On the other hand, high quality MAGs require >90% completeness and <5% contamination. From memory, I recall that roughly ~1900/4000 of my gut MAGs met the HQ criteria. I could split the gut GEMs into MQ and HQ sets, and see if it is indeed the MQ GEMs that are bringing up the average fraction of reactions without GPR rules.

  2. Poor gene annotation: The genes encoding proteins that catalyze those metabolic reactions with missing GPR rules could actually present in the MAG, but may not be "recognized" by carveme/the bigg database, i.e. this could be highlighting some of the weaknesses in our collective gene annotation, particularly for uncultured genomes. Although this is a much more interesting explanation, I suspect that proving this would likely require some experimental validation.

  3. Fragmented assemblies: The genes could be present in the MAG but broken up across a number of contigs within the MAG, so even if the genes are well annotated and recognized by the bigg database, some genes may be too fragmented to be identified. I could probably test this by splitting up my MAGs into buckets with certain ranges of allowed contigs, and see if the more fragmented MAG sets have a higher fraction of reactions without GPR rules.

Midnighter commented 4 years ago

Thank you for the detailed information. This is fascinating.

I could try re-carving the refseq genomes with the same carveme version I used for my gut GEMs, although I do not know if I will be able to get my hands on the exact version of the database (release 84) that daniel used.

Before doing that, If you look at the average scores of the different scored sections (which you should be able to with the entire supplementary document) that should give you a quick idea where your models and the original model collection deviate.

From memory, I recall that roughly ~1900/4000 of my gut MAGs met the HQ criteria. I could split the gut GEMs into MQ and HQ sets, and see if it is indeed the MQ GEMs that are bringing up the average fraction of reactions without GPR rules.

Might be worth a look. Although with almost half of the models being HQ, one might expect almost a bimodal distribution for (c) and that's definitely not the case. It looks like a pretty even distribution. However, if the lower half of that distribution is dominated by HQ and the upper half by MQ that'd certainly be a strong argument.

On 2., are you working with Jaime or BioByte on better automated annotations?

Your point 3. sounds very interesting. If you manage to do this I'm certainly interested to hear what you find :slightly_smiling_face:

franciscozorrilla commented 4 years ago

Before doing that, If you look at the average scores of the different scored sections (which you should be able to with the entire supplementary document) that should give you a quick idea where your models and the original model collection deviate.

Sounds good, do you think you can have a look at my forked repo to see if you can identify at a glance where I still need to make changes to obtain the full supplementary materials pdf? Alternatively, let me know if you have suggestions regarding how I can identify where the problems may be occurring. Here is the pdf file.

On 2., are you working with Jaime or BioByte on better automated annotations?

I recall briefly talking about this with Kiran and Jaime, but I do not know how much progress has been made. I will discuss this with Kiran when we start up in October.

Your point 3. sounds very interesting. If you manage to do this I'm certainly interested to hear what you find 🙂

More than happy to share my results!