franciscozorrilla / metaGEM

:gem: An easy-to-use workflow for generating context specific genome-scale metabolic models and predicting metabolic interactions within microbial communities directly from metagenomic data
https://franciscozorrilla.github.io/metaGEM/
MIT License
188 stars 40 forks source link

[Question]: Feature name data type issues and workflow failures when processing "crossMapSeries" output files in the "binRefine" workflow #168

Closed YuXia0 closed 1 week ago

YuXia0 commented 3 weeks ago

coverage_table.zip Dr. Francisco Zorrilla

Hello!

When using metaGEM for analysis, I encountered an issue involving feature name data type errors when the output files generated by "crossMapSeries" were processed in the "binRefine" workflow (Rule context).

The commands I executed are as follows:

  1. "bash metaGEM.sh -t megahit -j 2 -c 24 -m 120 -h 24 -l"

  2. "bash metaGEM.sh -t crossMapSeries -j 2 -c 24 -m 120 -h 24 -l"

  3. "bash metaGEM.sh -t binRefine -j 2 -c 24 -m 150 -h 24 -l"

When executing the binRefine workflow, the following error message appeared:

"Feature names are only supported if all input features have string names, but your input has ['int', 'str'] as feature name / column name types. If you want feature names to be stored and validated, you must convert them all to strings, by using X.columns = X.columns.astype(str) for example. Otherwise you can remove feature / column names from your input data, or convert them all to a non-string data type."

Although the workflow was able to continue temporarily after the error, it eventually failed shortly afterwards. All commands were run on the local machine using the --localoption.

Based on the error message, I understand that this is due to different data types (e.g., int and str) for the feature names (column names) in the input files. However, since these input files are generated by crossMapSeries, I am not sure how to best handle this issue without affecting the workflow. (Attached is the TSV file generated by crossMapSeries, located in output/concoct/sample_ID/cov)

Therefore, I would like to ask you the following questions:

  1. For output files generated by crossMapSeries, how do you recommend handling the feature name data type issue before entering the binRefine workflow? Are there any recommended best practices?

  2. If the column name data type needs to be converted, are there any specific implementation suggestions or tools that can help accomplish this task?

  3. Is this error likely related to the failure of the workflow? If so, what preventive or remedial measures can be recommended?

Thank you very much for your help and guidance! I hope to hear from you soon.

Best wishes,

franciscozorrilla commented 3 weeks ago

Hey @YuXia0,

  1. For output files generated by crossMapSeries, how do you recommend handling the feature name data type issue before entering the binRefine workflow? Are there any recommended best practices?

  2. If the column name data type needs to be converted, are there any specific implementation suggestions or tools that can help accomplish this task?

I would first check what version of snakemake you are running, and make sure its >=5.10.0, <5.31.1. You should not need to manually modify files between rules if everything is working correctly.

  1. Is this error likely related to the failure of the workflow? If so, what preventive or remedial measures can be recommended?

It looks like the error is coming from the metawrap environment, and potentially related to installation/version issues. It has also been raised before by users running metagem with the --local flag (https://github.com/franciscozorrilla/metaGEM/issues/138#issuecomment-1699323172). Please have a look at these issues/comments to check if the solution also works for you (https://github.com/BinPro/CONCOCT/issues/322 , https://github.com/BinPro/CONCOCT/issues/321#issuecomment-1373775878).

Let me know if this helps and if you have any other issues.

Best, Francisco

YuXia0 commented 2 weeks ago

@franciscozorrilla I checked my version of 'snakemake', I ran the workflow under the 'metagem' environment created with the 'metaGEM_env.yml' file, and there were no failures in the previous workflows, at least in the output logs. When processing the output file generated by 'crossMapSeries', I tried to convert the entire list or the column names individually into 'str' variables, but it still failed. I don't know how to deal with it. I need your help.

franciscozorrilla commented 2 weeks ago

Hey @YuXia0, I am happy to help, did you try checking these related issues? I believe that your solution can be found there.

It looks like the error is coming from the metawrap environment, and potentially related to installation/version issues. It has also been raised before by users running metagem with the --local flag (https://github.com/franciscozorrilla/metaGEM/issues/138#issuecomment-1699323172). Please have a look at these issues/comments to check if the solution also works for you (https://github.com/BinPro/CONCOCT/issues/322 , https://github.com/BinPro/CONCOCT/issues/321#issuecomment-1373775878).

Like I mentioned, the error appears to come from the metawrap environment, not the metagem environment. As you will see from the documentation, metaGEM sets up two separate environments since metaWRAP is not compatible with python3. Also, you should not have to manually parse intermediate files, that is the point of using a workflow manager :)

Could you please post a screenshot of the crossMapSeries output? I cannot open the file. Also, do you have properly generated bins yet? I am trying to understand if the binning rules are failing (these run with the metagem environment) or if the bin refinement/reassembly rules are failing (these run in metawrap env).

Best, Francisco

YuXia0 commented 1 week ago

@franciscozorrilla Thank you very much for your answer, I have solved all the problems