AuReMe / metage2metabo

From annotated genomes to metabolic screening in large scale microbiotas
https://metage2metabo.readthedocs.io
GNU Lesser General Public License v3.0
50 stars 7 forks source link

m2m_analysis powergraph not working as expected #41

Closed alsmadin01 closed 1 year ago

alsmadin01 commented 2 years ago

Hi @ArnaudBelcour,

I have been having some issues running m2m_analysis powergraph. Using genome-scale metabolic models for almost 3300 gut microbes, I am trying to visualize the minimal communities required to produce the output metabolites from addedvalue. All steps in the m2m pipeline worked except "m2m_analysis powergraph", for which I get the error attached below. Previously, this step worked for a smaller microbial sample, but I had issues with obtaining the full image (only a section of the powergraph image was observed in the output).

Do you have any recommendations on how I could rectify that?

Thank you!

Best,

Noor Alsmadi

fdfdc37b-f220-4717-bbe0-4b86a97be5e5

92530158-9db9-4b98-a19d-9fb75fcd69ec

ArnaudBelcour commented 2 years ago

Hi @alsmadin01,

I think the error associated with your screenshots is linked to the mpwt-TEST.tsv file. If I am understanding the error correctly, m2m powergraph tries to search for a lineage using the taxonomic name of the species (here faecis) instead of the corresponding taxonomic ID.

For this part of the workflow, the reading of the taxon file is less flexible than in other part of the workflow such as the reconstruction with mpwt. It is something that needs improvement but at this time we can not work on it.

So this part of the workflow needs the following structure for the mpwt-TEST.tsv file:

Column 1 (taxonomic name) Column 2 (taxonomic ID)
Blautia faecis 871665
Escherichia coli 562

m2m powergraph will then use the Column 2 (taxonomic ID) to get the lineage. If you have another information in column 2 (such as the taxonomic name), it will produce this error. Currently, this is not documented, I am sorry for this.

Previously, this step worked for a smaller microbial sample, but I had issues with obtaining the full image (only a section of the powergraph image was observed in the output).

I am intrigued by this, could you elaborate? The powergraph in the image was not complete? Was it with the svg file or the html representation of the powergraph?

Best regards, Arnaud Belcour.

alsmadin01 commented 2 years ago

Hi @ArnaudBelcour,

Thank you for your response.

As recommended, I did switch out the columns in the mpwt-TEST.tsv file, but I am still getting an error (attached below).

Regarding the powergraph image, I can only see part of the full image for the svg file (attached). The html file seemed to be okay (attached).

Screen Shot 2022-09-16 at 11 59 03 AM Screen Shot 2022-09-16 at 11 59 17 AM Screen Shot 2022-09-16 at 12 03 31 PM Screen Shot 2022-09-16 at 12 04 56 PM

Best regards,

Noor Alsmadi

ArnaudBelcour commented 2 years ago

Hi @alsmadin01,

For the issue with mpwt-TEST.tsv, it is strange, can you send me the file or an example of the file?

The issue with the svg file for the powergraph is due to the fact that the picture is by default zoomed compared to its real size. Therefore the preview made by the picture visualization software shows only a part of the powergraph. The powergraph created is complete but is shown truncated. To see it, I recommend using a svg editor software such as Inkscape. With this software, you can load the svg file and dezoom and look at the complete powergraph (which should be similar to the one in the html file). Then it is possible to edit the file and export it in an other format (such as png).

Best regards, Arnaud Belcour.

alsmadin01 commented 2 years ago

Hi @ArnaudBelcour,

Thank you for your response! I will try and do as recommended to get the full powergraph image in svg. Please also find attached a screenshot of the mpwt-TEST.tsv file contents.

Thank you for your help!

Best regards,

Noor Alsmadi

Screen Shot 2022-09-16 at 1 01 47 PM
ArnaudBelcour commented 2 years ago

It is strange because you do not have taxon ID in this file. Do you have a column containing number in your previous file, such as:

1485 Clostridium MGY000000001
871665 Blautia faecis MGY000000002

Because the column needed by metage2metabo is here the first one (with the number corresponding to taxonomic ID). If it is the case, you need to swap the first column with the second:

Clostridium 1485 MGY000000001
Blautia faecis 871665 MGY000000002

If you do not have taxon ID in this file, it is strange because it is required by mpwt to run (except you have the taxonomic ID in the genbank file). Where does this file come from?

Best regards, Arnaud Belcour.

alsmadin01 commented 2 years ago

Hi @ArnaudBelcour,

The tsv file I used to create the gbk files using emapper2gbk was formatted as follows:

Column 1: Taxon ID (e.g. MGYG000000001, MGYG000000002). Column 2: Taxonomic name

I then ran metabolic reconstruction with pathway tools, and the rest of the M2M pipeline without any issues.

I am not sure what the first column here represents:

1485 Clostridium MGY000000001
871665 Blautia faecis MGY000000002

Could you please elaborate? I don't think I fully understand your concern in the previous message.

Best regards,

Noor Alsmadi

ArnaudBelcour commented 2 years ago

Hi @alsmadin01,

Ok, I understand it better, now. Sorry, there was a misunderstanding in the file that was used.

I thought that you were using a file following the mpwt taxon_id.tsv structure (that is needed for this part of the workflow). But you are using the emapper2gbk file.

So first, to avoid any issue with word and definition, when I talk about a taxonomic ID, I refer to the ID present in the NCBI taxonomy database. In this database there is an ID for each species, genus, family, etc. . For example, for Escherichia coli it is 562. (you can see it on the url page after the Taxonomy ID).

What you refer as the Taxonomic ID (MGYG000000001) is more the ID of the organism associated with the data. So we were not talking about the same thing.

So the emapper2gbk file contains: Column 1: Organism ID (e.g. MGYG000000001, MGYG000000002). Column 2: Taxonomic name

But m2m powergraph requires the taxon_id file, such as Column 1: Organism ID (e.g. MGYG000000001, MGYG000000002). Column 2: Taxonomic ID

It is possible to create this file if you have the folder that you give as input to mpwt or m2m recon to reconstruct metabolic networks. If this is the case, you can use the following command:

mpwt topf -f input_folder -o output_folder --cpu cpu_number

input_folder corresponds to the folder containing the genbank files for mpwt/m2m recon.

Inside the output_foder, you will have one folder for each genbank that you can delete. But you will also have a file named taxon_id.tsv. This is the file that is required by m2m powergraph for the --taxon option.

Do you still have the mpwt/m2m recon folder? If not there is other possibility to create this file but it will require using a python script for the conversion.

Best regards, Arnaud Belcour.

alsmadin01 commented 2 years ago

Hi @ArnaudBelcour,

Okay – now I understand what is going on. Sorry for the confusion.

I did as requested and got a "taxon_id.tsv" file (below), but I still get an error when running m2m_analysis powergraph (see attached).

Screen Shot 2022-09-16 at 3 29 27 PM Screen Shot 2022-09-16 at 3 29 48 PM Screen Shot 2022-09-16 at 3 30 00 PM

Best regards,

Noor Alsmadi

ArnaudBelcour commented 2 years ago

Hi @alsmadin01,

This is strange, I have tested the taxid 1622 with my version of ete3 NCBI taxonomy and it is working. Maybe the taxonomic database of your package is not up-to-date. You can check this by opening a python terminal and use the following lines:

from ete3 import is_taxadb_up_to_date

print(is_taxadb_up_to_date())

If this returns False, then the NCBI database of your package is not up-to-date. To fix this, you can update it by using the following lines:

from ete3 import NCBITaxa

ncbi = NCBITaxa()
ncbi.update_taxonomy_database()

This will download the latest version of the NCBI taxonomy database and it could solve this issue.

Best regards, Arnaud Belcour.

alsmadin01 commented 2 years ago

Hi @ArnaudBelcour,

It seems like my database was already up to date given that the command returned as "True".

Best regards,

Noor Alsmadi

ArnaudBelcour commented 2 years ago

Hi @alsmadin01,

OK, this is really strange. But I have looked at the code behind is_taxadb_up_to_date() and I am not sure if it checks really what could be the source of your issue.

So could you try to update the taxonomy database (using the python lines in my previous post), just to be sure?

Best regards, Arnaud Belcour.

alsmadin01 commented 2 years ago

Hi @ArnaudBelcour,

I just tried to update it, but I am running into this error:

Screen Shot 2022-09-19 at 2 20 31 PM

Best regards,

Noor Alsmadi

ArnaudBelcour commented 2 years ago

Hi @alsmadin01,

It seems that the downloaded version of the database and the version installed have conflicts for the taxids. Can you try to delete the folder /nfs/homes/alsmadin/.etetoolkit?

This will delete the installed version of the database and force the download and install of the new database.

Best regards, Arnaud Belcour.

alsmadin01 commented 2 years ago

Hi @ArnaudBelcour,

I am still getting an error, this time when running ncbi = NCBITaxa().

Screen Shot 2022-09-19 at 3 18 55 PM

Best regards,

Noor Alsmadi

ArnaudBelcour commented 2 years ago

Hi @alsmadin01,

When I compare the number associated with my version and your version, they are not equals:

Updating taxdump.tar.gz from NCBI FTP site (via HTTP)...
Loading node names...
2444356 names loaded.
283271 synonyms loaded.
Loading nodes...
2444356 nodes loaded.
Linking nodes...
Tree is loaded.
Updating database: /root/.etetoolkit/taxa.sqlite ...
 2444000 generating entries... 
Uploading to /root/.etetoolkit/taxa.sqlite

Inserting synonyms:      280000 
Inserting taxid merges:  65000 
Inserting taxids:       2440000 

So there is a possible issue with the taxonomy database used.

First, which version of ete3 do you use (the current one is 3.1.2)? If it is a less recent version could you update it?

And in second, do you have a file named taxdump.tar.gz in the folder where you launched the python terminal? If yes, could try to delete it, delete the /nfs/homes/alsmadin/.etetoolkit folder and rerun the python lines? This is the taxonomy database stored in a compressed format and used to update the sqlite database associated with ete3 and contained in /nfs/homes/alsmadin/.etetoolkit.

Best regards, Arnaud Belcour.

alsmadin01 commented 2 years ago

Hi @ArnaudBelcour,

Thank you for your response. I succeeded in downloading the updated NCBI database after updating ete3 from version 3.1.1 to 3.1.2 as requested. I then ran m2m_analysis powergraph but still ran into an error (unfortunately):

Screen Shot 2022-09-20 at 7 33 15 PM

Best regards,

Noor Alsmadi

ArnaudBelcour commented 1 year ago

Hi @alsmadin01,

It's good that the issue with ete3 has been resolved. I am sorry, I should have think quicker about a simple version issue.

For the new error, it is an error that I have encountered with clyngor, a dependency of m2m (https://github.com/Aluriak/PowerGrASP/issues/1). I have created a new release, m2m 1.5.1, that should fix it. Can you try it and tell me if it fix this issue?

Best regards, Arnaud Belcour.

alsmadin01 commented 1 year ago

Hi @ArnaudBelcour,

Thank you for the recommendations. I installed m2m 1.5.1 and m2m test worked. However, I ran into an error when running m2m analysis using the test data provided:

Screen Shot 2022-09-21 at 11 46 27 PM

Best regards,

Noor Alsmadi

ArnaudBelcour commented 1 year ago

Hi @alsmadin01,

I have checked and it seems to be an issue with the version of bubbletools package. The version you have seems to be inferior to the 0.6.7, after this one the function bubble_to_js() has the argument width_as_cover.

Can you try pip install bubbletools --upgrade and see if it solves this issue?

Best regards, Arnaud Belcour.

alsmadin01 commented 1 year ago

Hi @ArnaudBelcour,

I updated bubbletools, ran m2m_analysis using the test data, and it worked. However, when I run the analysis using my community of 3315 microbes, I get the following issue:

Screen Shot 2022-09-22 at 9 51 23 PM

Best regards,

Noor Alsmadi

ArnaudBelcour commented 1 year ago

Hi @alsmadin01,

I have identified your issue: m2m_analysis selects a set of colors to apply to the powergraph. And in your case there was not enough colors for all the taxa.

I have made a new release (1.5.2) that should fix this issue. Furthermore, I have added different shapes for the node in the html output (circle for essential symbionts and rectangle for alternative symbionts). This could help the understanding of the results.

Best regards, Arnaud Belcour.

alsmadin01 commented 1 year ago

Hi @ArnaudBelcour,

Thank you for your response. I installed m2m 1.5.2, but I keep getting the same error.

Best regards,

Noor Alsmadi

ArnaudBelcour commented 1 year ago

Hi @alsmadin01,

Can you post a screenshot of the error? Because, I have modified the code, so even if the error is similar, there should be little differences (such as the line of the code at which the error occurs) that could help me to identify the error.

For example, if you encounter the error with the same part of the code (the taxon_colors[taxon] = used_colors[index]) in version 1.5.1 it is on line 144 (as shown in your error message) and in version 1.5.2, it should be on line 176.

Best regards, Arnaud Belcour.

alsmadin01 commented 1 year ago

Hi @ArnaudBelcour,

Please find below a screenshot of the error:

Screen Shot 2022-09-26 at 12 37 07 PM

Best regards,

Noor Alsmadi

ArnaudBelcour commented 1 year ago

Hi @alsmadin01,

I have checked and found the issue with the modification I have made in 1.5.2. I make the release 1.5.3 that should really fix this issue this time. Sorry for the incorrect fix.

Best regards, Arnaud Belcour.

alsmadin01 commented 1 year ago

Hi @ArnaudBelcour,

I works perfectly! Thank you so much for your help and patience – I really appreciate it.

Best regards,

Noor Alsmadi

ArnaudBelcour commented 1 year ago

Hi @alsmadin01,

I am glad that we manage to fix all the issues you encounter with m2m_analysis. So I will close this issue.

Best regards, Arnaud Belcour.