configure to analyze just mutations in the mutated ectodomain region

jbloom commented 8 months ago

@arjunaditham, your plots and stats will be better if you set up the analysis to only look at and plot mutations in the ectodomain region you mutated, and not the signal peptide.

You can sort of see how to do this by looking at @bblarsen-sci's Nipah RBD repo.

Basically, I believe what you do is set the PacBio amplicon to classify the coding sequence 5' and 3' of the mutated ectodomain to be named as different features, as in the gene_flank5 feature here in @bblarsen-sci's repo.

Then you set the PacBio feature parsing to not allow any mutations in those regions, as here.

This will make your plots just show the mutated ectodomain region, which is the only region for which we can get good mutation estimates.

@bblarsen-sci, chime in if there is anything else to doing this that I forgot.

arjunaditham commented 8 months ago

@jbloom I've attempted this analysis seeking advice from @caelanradford and @bblarsen-sci and was not able to get this working. I am pasting the error below (I think in 'build_codon_variants'), and the updated files follow this message. The repo is up-to-date (version ceb5ee6).

Here is the PacBio_amplicon.gb. Here is the PacBio_feature_parse_specs.yaml.

Both were updated as instructed above using formatting from Brendan's repo.

Traceback (most recent call last):
  File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/bin/papermill", line 11, in <module>
    sys.exit(papermill())
             ^^^^^^^^^^^
  File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/papermill/cli.py", line 250, in papermill
    execute_notebook(
  File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/papermill/execute.py", line 128, in execute_notebook
    raise_for_execution_errors(nb, output_path)
  File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/papermill/execute.py", line 232, in raise_for_execution_errors
    raise error
papermill.exceptions.PapermillExecutionError: 
---------------------------------------------------------------------------
Exception encountered at "In [10]":
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[10], line 23
     21 # read wildtype protein
     22 refprot = str(Bio.SeqIO.read(config["gene_sequence_protein"], "fasta").seq)
---> 23 assert len(refprot) >= site_numbering_map["sequential_site"].max()
     24 assert site_numbering_map["sequential_site"].min() >= 1
     26 # add reference site numbering and remove wildtype to wildtype mutations

AssertionError:

jbloom commented 8 months ago

@arjunaditham, I think you have to update your site numbering map file.

I think that file needs to have the sequential_site column number 1, 2, ... just the sites that are mutated in the protein. Right now it includes sites that are before and after the mutated region.

Does this make sense?

arjunaditham commented 8 months ago

I have updated site_numbering_map.csv accordingly and have queued this job to run on slurm.

I had tried what you described before and failed, but, upon discussion with @bblarsen-sci, it turns out that my 'sequential_site' column should start with 1 as it seems to refers to the site number within the mutagenized region only. Generally, this also corresponds with 'reference_site' but not in my case since I am not doing DMS for the whole glycoprotein.

If this works, I will also edit to remove the transmembrane and cytoplasmic tail and edit all documents accordingly.

arjunaditham commented 8 months ago

Updated to constrain analysis to ectodomain. Removed signal peptide, transmembrane domain, and cytoplasmic domain.

Re-numbered sites to comport with conventional numbering scheme in structures and prior mutagenesis studies.

dms-vep / RABV_Pasteur_G_DMS

configure to analyze just mutations in the mutated ectodomain region #1