Closed jbloom closed 8 months ago
@jbloom I've attempted this analysis seeking advice from @caelanradford and @bblarsen-sci and was not able to get this working. I am pasting the error below (I think in 'build_codon_variants'), and the updated files follow this message. The repo is up-to-date (version ceb5ee6).
Here is the PacBio_amplicon.gb. Here is the PacBio_feature_parse_specs.yaml.
Both were updated as instructed above using formatting from Brendan's repo.
Traceback (most recent call last):
File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/bin/papermill", line 11, in <module>
sys.exit(papermill())
^^^^^^^^^^^
File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context(), *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/papermill/cli.py", line 250, in papermill
execute_notebook(
File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/papermill/execute.py", line 128, in execute_notebook
raise_for_execution_errors(nb, output_path)
File "/fh/fast/bloom_j/computational_notebooks/aaditham/2024/RABV_Pasteur_G_DMS/.snakemake/conda/f1180436d2da8354655661ed54d7d6e1_/lib/python3.11/site-packages/papermill/execute.py", line 232, in raise_for_execution_errors
raise error
papermill.exceptions.PapermillExecutionError:
---------------------------------------------------------------------------
Exception encountered at "In [10]":
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
Cell In[10], line 23
21 # read wildtype protein
22 refprot = str(Bio.SeqIO.read(config["gene_sequence_protein"], "fasta").seq)
---> 23 assert len(refprot) >= site_numbering_map["sequential_site"].max()
24 assert site_numbering_map["sequential_site"].min() >= 1
26 # add reference site numbering and remove wildtype to wildtype mutations
AssertionError:
@arjunaditham, I think you have to update your site numbering map file.
I think that file needs to have the sequential_site
column number 1, 2, ... just the sites that are mutated in the protein. Right now it includes sites that are before and after the mutated region.
Does this make sense?
I have updated site_numbering_map.csv accordingly and have queued this job to run on slurm.
I had tried what you described before and failed, but, upon discussion with @bblarsen-sci, it turns out that my 'sequential_site' column should start with 1 as it seems to refers to the site number within the mutagenized region only. Generally, this also corresponds with 'reference_site' but not in my case since I am not doing DMS for the whole glycoprotein.
If this works, I will also edit to remove the transmembrane and cytoplasmic tail and edit all documents accordingly.
Updated to constrain analysis to ectodomain. Removed signal peptide, transmembrane domain, and cytoplasmic domain.
Re-numbered sites to comport with conventional numbering scheme in structures and prior mutagenesis studies.
@arjunaditham, your plots and stats will be better if you set up the analysis to only look at and plot mutations in the ectodomain region you mutated, and not the signal peptide.
You can sort of see how to do this by looking at @bblarsen-sci's Nipah RBD repo.
Basically, I believe what you do is set the PacBio amplicon to classify the coding sequence 5' and 3' of the mutated ectodomain to be named as different features, as in the
gene_flank5
feature here in @bblarsen-sci's repo.Then you set the PacBio feature parsing to not allow any mutations in those regions, as here.
This will make your plots just show the mutated ectodomain region, which is the only region for which we can get good mutation estimates.
@bblarsen-sci, chime in if there is anything else to doing this that I forgot.