Nextera prep + random primer

samcho commented 3 years ago

Hello,

I used a Nextera prep with a random 9-mer primer. From the original Shapemapper paper in 2015, it says that I should turn off the randomlyPrimed flag, but the command no longer exists in Shapemapper2. What is the appropriate way for me to use Shapemapper2? If I do not use the random-primer-len parameter, I get the following error message: "RuntimeError: Error: input data lengths do not all match."

Also, I had an issue merging the R1 and R2 fastq files because Shapemapper says that the "R1" and "R2" flags are not recognized arguments. I also do not see them in the list of parameters on the main page. Am I doing something wrong?

shapemapper commented 3 years ago

Hi Sam, shapemapper2 defaults to not performing any primer trimming, so if you are using Nextera, there are no additional arguments required. As for the --R1/R2 unrecognized argument issue, I'll admit the documentation could be a little clearer. With shapemapper2, you need to provide --modified before the treated inputs, and --untreated before the control inputs. If you still have issues getting it to run, please paste the exact command you tried, and someone should be able to help you format things correctly. You might also find the usage examples section of the Readme helpful: https://github.com/Weeks-UNC/shapemapper2#usage-examples

samcho commented 3 years ago

Hello Steven,

Thank you for that information and your quick response. It helped me better understand how to use your program correctly.

However, I am having another issue running Shapemapper2. My collaborator had previously run the original version of Shapemapper, which it did not count the complex deletion, insertion, and complex insertion and only counted the misincorporation and unambiguous deletion, to determine post-transcriptional modification of RNA molecules. Shapemapper1 determined these modifications correctly, while Shapemapper2 is predicting modifications where there are no modifications in the RNA fragment.

Is there a way to obtain only the mutation rates for the misincorporations and unambiguous deletions like in the original version of Shapemapper in Shapemapper2?

Psirving commented 3 years ago

There's not a super easy way to do this. What you can do is use --output-counted-mutations, which will make output files in which each column is a count of each mutation type, and each row is a nucleotide position. You can recalculate mutation rates from this table by summing only the mutation types you are interested in and dividing by the new effective read depth. This is how I have done it in the past for detecting CMC-adducts at pseudouridine sites.

samcho commented 3 years ago

Dear Patrick,

Thank you for your suggestion. I am still not getting the same values that was in Shapemapper1. Can you tell me exactly which combinations of mutations, deletions, insertions, etc. worked best for you in determining pseudouridines with Shapemapper2? Also, can you clarify what you meant by "new effective read depth"? Do you mean the read depth that is in the output "mutations-count" file?

Psirving commented 3 years ago

All of the column names in this file are listed below, and you can find them in docs/file-formats.md. I tried several things, and I actually never compared this to the original ShapeMapper profiles, so I'm not sure if you'll be able to reproduce that exactly. I think I always included non-ambiguous single-nuc indels and mismatches, and tried including either/both of non-ambiguous multinuc and complex mutations. By "new effective read depth" I mean the effective read depth subtracted by the total of all excluded mutations. I'm not sure if the read depth in the mutations-count file is the effective read depth or the total read depth. The effective read depth is reported in the profiles.txt file.

Another strategy I tried was comparing the profiles of left-aligned and right-aligned ambiguous mutations. Left is default, and you can get right by using --right-align-ambig when calling shapemapper. If a site appeared in both, I called it unambiguously, and if it only appeared in one or the other, it was an ambiguous site.

Mutation classifications: "A-","T-","G-","C-" (single-nucleotide deletions) "-A","-T","-G","-C","-N" (single-nucleotide insertions) "AT", "AG", "AC", "TA", "TG", "TC", "GA", "GT", "GC", "CA", "CT", "CG" (single-nucleotide mismatches) "multinuc_deletion" "multinuc_insertion" "multinuc_mismatch" "complex_deletion" "complex_insertion" "N_match" (not a real mutation, just an ambiguous basecall) All classifications may also have _ambig appended if a mutation involves any ambiguously aligned nucleotides.

samcho commented 3 years ago

Hello,

I have another question.. I was wondering if ShapeMapper1, similar to ShapeMapper2, also collapses nearby mutations. If that is the case, what is the threshold that I should should use in ShapeMapper1? Should I use 6 as is done for ShapeMapper2?

Psirving commented 3 years ago

I'm not sure if ShapeMapper1 does this. ShapeMapper2 is optimized for SHAPE and DMS data for structure prediction. You might need to empirically determine the best parameters for your experiment.

I think that the 6 nucleotide threshold was physically justified by the footprint of Superscript II. Once the enzyme is >6 nt away from an adduct, it is unlikely that it's activity will be affected by that adduct. I'm not sure if that's accurate, but if true, than it is probably always good idea to collapse mutations within 6 nucleotides of each other. Ultimately, it is justified here because it produces the best SHAPE profiles for structure prediction.

Weeks-UNC / shapemapper2

Nextera prep + random primer #22