MicrobialDarkMatter / nanomotif

Nanomotif - a tool for identifying methylated motifs in metagenomic samples
MIT License
23 stars 1 forks source link

MTase-linker: Flag for methylation degree threshold #78

Open Ge0rges opened 3 weeks ago

Ge0rges commented 3 weeks ago

Hello,

Wanted to share the following error obtained when running MTase-linker.

Traceback (most recent call last):
  File "/localdata/researchdrive/gkanaan/tools/ML_dependencies/ML_envs/06b3259e5e81fef4369da217324f5061_/lib/python3.12/site-packages/pandas/core/indexes/base
.py", line 3790, in get_loc
    return self._engine.get_loc(casted_key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "index.pyx", line 152, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 181, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'n_mod_bin'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/localdata/researchdrive/gkanaan/seaice_methylation/.snakemake/scripts/tmpk_q_90_l.motif_assignment.py", line 43, in <module>
    mean_methylation = nanomotif_table['n_mod_bin'] / (nanomotif_table['n_mod_bin'] + nanomotif_table['n_nomod_bin'])
                       ~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/localdata/researchdrive/gkanaan/tools/ML_dependencies/ML_envs/06b3259e5e81fef4369da217324f5061_/lib/python3.12/site-packages/pandas/core/frame.py", l
ine 3893, in __getitem__
    indexer = self.columns.get_loc(key)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/localdata/researchdrive/gkanaan/tools/ML_dependencies/ML_envs/06b3259e5e81fef4369da217324f5061_/lib/python3.12/site-packages/pandas/core/indexes/base
.py", line 3797, in get_loc
    raise KeyError(key) from err
KeyError: 'n_mod_bin'
JSBoejer commented 2 weeks ago

Are you using the bin-motifs.tsv file as input to MTase-linker? The motif.tsv and motif-scored.tsv does not work with the pipeline.

Ge0rges commented 2 weeks ago

Ah yes, I am using motifs.tsv. I can switch to using bin-motifs.tsv. I am conducting this analysis within the context of a single genome so get mislead by the bin prefix.

What is the difference between those files? The output section doesn't include that information yet.

JSBoejer commented 2 weeks ago

Also, please make sure to use the newest version of Nanomotif (v. 0.1.15) as it resolves issues present in previous versions

The difference between motifs.tsv and bin_motifs.tsv lies in a series of post-processing steps applied to generate a consensus set of motifs across contigs for each bin (genome in your case). Some contigs will not have the motifs in the sequence, and other contigs might show slight variation in the motif compared to the rest of the bin due to noise or just the context in which the motifs is observed. To account for this, we apply post-processing to find consensus motifs across a whole genome and output this in bin-motifs.tsv. <

If you want to find motifs in single genome bin-motifs.tsv is your go to file. motifs.tsv and score-motifs.tsv are more relevant in regard to binning.

For more details on these post-processing steps, refer to supplementary note 1 of our preprint: https://www.biorxiv.org/content/10.1101/2024.04.29.591623v1

Ge0rges commented 2 weeks ago

Got it. Regarding the version did you mean v0.4.15? That's indicated both on the PyPi page and the your conda meta.yaml. My installation defaulted to that. When I forced pip to install 0.1.15 I got a version of nanomotif with slightly different commands I believe including complete-workflow which I think wasn't present in the previous version I had installed.

JSBoejer commented 2 weeks ago

Yes, sorry about the confusion. I meant v0.4.15.

Ge0rges commented 1 week ago

Getting a different error now on latest version with correct file input.

Select jobs to execute...

[Tue Oct 22 11:35:25 2024]
rule motif_assignment:
    input: nanomotif/brevundimonas_r-contigs/mtase-linker/pfam_hmm_hits/brevundimonas_r-contigs_gene_id_mod_table.tsv, nanomotif/brevundimonas_r-contigs/mtase-linker/defensefinder/brevundimonas_r-contigs_processed_defense_finder_mtase.tsv, nanomotif/brevundimonas_r-contigs/mtase-linker/blastp/brevundimonas_r-contigs_rebase_mtase_sign_alignment.tsv, nanomotif/brevundimonas_r-contigs/bin-motifs.tsv, /localdata/researchdrive/gkanaan/seaice_methylation/nanomotif/contig_bin.tsv
    output: nanomotif/brevundimonas_r-contigs/mtase-linker/mtase_assignment_table.tsv, nanomotif/brevundimonas_r-contigs/mtase-linker/nanomotif_assignment_table.tsv
    jobid: 1
    reason: Missing output files: nanomotif/brevundimonas_r-contigs/mtase-linker/mtase_assignment_table.tsv; Input files updated by another job: nanomotif/brevundimonas_r-contigs/mtase-linker/pfam_hmm_hits/brevundimonas_r-contigs_gene_id_mod_table.tsv, nanomotif/brevundimonas_r-contigs/mtase-linker/defensefinder/brevundimonas_r-contigs_processed_defense_finder_mtase.tsv, nanomotif/brevundimonas_r-contigs/mtase-linker/blastp/brevundimonas_r-contigs_rebase_mtase_sign_alignment.tsv
    resources: tmpdir=/tmp

Activating conda environment: ../../../../researchdrive/gkanaan/tools/ML_dependencies/ML_envs/71dd0a79701938f24ea6c2c3e756d4dc_
Activating conda environment: ../../../../researchdrive/gkanaan/tools/ML_dependencies/ML_envs/71dd0a79701938f24ea6c2c3e756d4dc_
Traceback (most recent call last):
  File "/localdata/researchdrive/gkanaan/seaice_methylation/.snakemake/scripts/tmpnqxvjl1f.motif_assignment.py", line 103, in <module>
    nanomotif_table_mm50.loc[:,'linked'] = False
    ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
  File "/researchdrive/gkanaan/tools/ML_dependencies/ML_envs/71dd0a79701938f24ea6c2c3e756d4dc_/lib/python3.12/site-packages/pandas/core/indexing.py", line 885, in __setitem__
    iloc._setitem_with_indexer(indexer, value, self.name)
  File "/researchdrive/gkanaan/tools/ML_dependencies/ML_envs/71dd0a79701938f24ea6c2c3e756d4dc_/lib/python3.12/site-packages/pandas/core/indexing.py", line 1809, in _setitem_with_indexer
    raise ValueError(
ValueError: cannot set a frame with no defined index and a scalar
[Tue Oct 22 11:35:28 2024]
Error in rule motif_assignment:
    jobid: 1
    input: nanomotif/brevundimonas_r-contigs/mtase-linker/pfam_hmm_hits/brevundimonas_r-contigs_gene_id_mod_table.tsv, nanomotif/brevundimonas_r-contigs/mtase-linker/defensefinder/brevundimonas_r-contigs_processed_defense_finder_mtase.tsv, nanomotif/brevundimonas_r-contigs/mtase-linker/blastp/brevundimonas_r-contigs_rebase_mtase_sign_alignment.tsv, nanomotif/brevundimonas_r-contigs/bin-motifs.tsv, /localdata/researchdrive/gkanaan/seaice_methylation/nanomotif/contig_bin.tsv
    output: nanomotif/brevundimonas_r-contigs/mtase-linker/mtase_assignment_table.tsv, nanomotif/brevundimonas_r-contigs/mtase-linker/nanomotif_assignment_table.tsv
    conda-env: /researchdrive/gkanaan/tools/ML_dependencies/ML_envs/71dd0a79701938f24ea6c2c3e756d4dc_

RuleException:
CalledProcessError in file /Accounts/gkanaan/miniconda3/nanomotif/lib/python3.9/site-packages/nanomotif/mtase_linker/MTase_linker.smk, line 197:
Command 'source /Accounts/gkanaan/anaconda3/bin/activate '/researchdrive/gkanaan/tools/ML_dependencies/ML_envs/71dd0a79701938f24ea6c2c3e756d4dc_'; set -euo pipefail;  python /localdata/researchdrive/gkanaan/seaice_methylation/.snakemake/scripts/tmpnqxvjl1f.motif_assignment.py' returned non-zero exit status 1.
  File "/Accounts/gkanaan/miniconda3/nanomotif/lib/python3.9/site-packages/nanomotif/mtase_linker/MTase_linker.smk", line 197, in __rule_motif_assignment
  File "/Accounts/gkanaan/miniconda3/nanomotif/lib/python3.9/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-10-22T113315.257866.snakemake.log
MTase-linker failed with error: Command '['snakemake', '--snakefile', '/Accounts/gkanaan/miniconda3/nanomotif/lib/python3.9/site-packages/nanomotif/mtase_linker/MTase_linker.smk', '--cores', '20', '--config', 'THREADS=20', 'ASSEMBLY=/localdata/researchdrive/gkanaan/seaice_methylation/mags//brevundimonas_r-contigs.fna', 'CONTIG_BIN=/localdata/researchdrive/gkanaan/seaice_methylation/nanomotif/contig_bin.tsv', 'OUTPUTDIRECTORY=nanomotif/brevundimonas_r-contigs/mtase-linker', 'DEPENDENCY_PATH=/researchdrive/gkanaan/tools/ML_dependencies', 'IDENTITY=80', 'QCOVS=80', 'NANOMOTIF=nanomotif/brevundimonas_r-contigs/bin-motifs.tsv', '--use-conda', '--conda-prefix', '/researchdrive/gkanaan/tools/ML_dependencies/ML_envs']' returned non-zero exit status 1.
JSBoejer commented 1 week ago

Can you provide the bin-motifs.tsv you are using?

Ge0rges commented 1 week ago

Here it is:

bin     mod_type        motif   mod_position    n_mod_bin       n_nomod_bin     motif_type      motif_complement        mod_position_complement n_mod_complement    n_nomod_complement
brevundimonas_r-contigs m       GGCGCC  2       130     159     palindrome      GGCGCC  2       130     159
metagenome_assembly     m       GGCGCC  2       130     159     palindrome      GGCGCC  2       130     159
JSBoejer commented 1 week ago

The error arises from a filtering step in the motif assignment process. MTase-linker only assigns motifs that are methylated in more than 50% of their occurrences across the entire genome. This is defined by the formula:

n_mod_bin / (n_mod_bin + n_nomod_bin) > 0.5

From literature (Beaulaurier 2019), we know that if a methylation motif is targeted by an MTase, typically >95% of motif occurrences are methylated. This is the reason why we choose this threshold of 50%.

In your case, the two motifs have a methylation level below this threshold. As a result, MTase-linker filters these motifs out and attempts to assign an empty table, leading to the error. Thus, currently MTase-linker does not support the assignment of these motifs. Would you be interested in a configurable flag that could adjust this threshold?

It would be interesting to filter the modkit pileup for methylations related to the motif, and then make a similar plot to the ones in figure S8 of the Nanomotif article. I guess you would see something like the middle plot for figure S8.

You might also consider adjusting the --threshold_methylation_general, which determines whether a positions is seen as methylated or not.

For further details, you might find this previous discussion helpful: link to issue #60.

Ge0rges commented 1 week ago

Hi @JSBoejer , that would be a good flag to have. Generating something like S8 would indeed be interesting! Thanks.