Weeks-UNC / shapemapper2

Public repository for ShapeMapper 2 releases
Other
30 stars 16 forks source link

Error: must provide at least one target sequence #14

Open s-t-calus opened 4 years ago

s-t-calus commented 4 years ago

Hello,

Executed code:

shapemapper --name example --target ../FASTA/RNA.fasta --out RNA_shapemap --amplicon --modified --R1 100mM_1M7_RNA_R1.fastq --R2 100mM_1M7_RNA_R2.fastq --untreated --R1 DMSO_1M7_RNA_R1.fastq --R2 DMSO_1M7_RNA_R2.fastq --denatured --R1 Denat_contr_1M7_RNA_R1.fastq --R2 Denat_contr_1M7_RNA_R2.fastq

But I'm getting error:

Running /software/shapemapper-v2.1.5/bin/shapemapper (in conda environment /software/shapemapper-v2.1.5)

Started ShapeMapper v2.1.5 at 2020-03-16 14:01:25 Output will be logged to shapemapper_log.txt Running from directory: /data/SHAPE_test/Backup args: Traceback (most recent call last): File "/software/shapemapper-v2.1.5/shapemapper-2.1.5/internals/python/cli.py", line 141, in run(sys.argv) File "/software/shapemapper-v2.1.5/shapemapper-2.1.5/internals/python/cli.py", line 51, in run pipeline, arg_dict = ap.construct(rest_args) File "/software/shapemapper-v2.1.5/shapemapper-2.1.5/internals/python/pyshapemap/pipeline_arg_parser.py", line 530, in construct kw = parse_args(args) File "/software/shapemapper-v2.1.5/shapemapper-2.1.5/internals/python/pyshapemap/pipeline_arg_parser.py", line 274, in parse_args raise RuntimeError("Error: must provide at least one target sequence (--target)") RuntimeError: Error: must provide at least one target sequence (--target)

Moreover, shapemapper --version generated exactly the same error --help does not work as well, does it mean there is an issue with installation?

shapemapper commented 4 years ago

Hm. That's weird. 1) What type of environment/operating system are you running on? 2) From inside the shapemapper-2.1.5 folder, run

source internals/paths/bin_paths.sh

then

python --version
python -c 'import sys; print(sys.path)'

and see what those commands output.

s-t-calus commented 4 years ago

We managed to fix that error, it was a bug during the installation process.

Nonetheless, I have another issue, data was generated with MiniSeq 2x150 run for ~100bp in vitro transcribed RNA, which means there is some wasted nucleotides in the FASTQ files as we oversequenced these libraries. Output file from Shapemapper produced "Note: possible data quality issue - see log file" - attached. The SHAPE reactivity is more or less at correct positions but noticed that mutation rate is quite low. Do you think this issue is related to oversequencing of the reads or there is no much of difference between Modified and Untreated RNA molecules due to degraded 1M7 and NAI-N3? I tested both SHAPE-reagents at 100mM conc. and note and reactivity looks similar, which is correct but since that was our first run I would appreciate your feedback.

Regards stc

SHAPE_1M7_100mM_profiles.pdf

Psirving commented 4 years ago

I think 1&2 are most likely, because your background mutation rates are also low.

  1. DNA contamination - If this is the case, you might see many reads that have 0 mutations, and some with 3-5. I think the flag "--per-read-histograms" will give you this info. You could also do PCR with no RT step and see if you get similar product. The solution is just more/longer DNase treatment.
  2. RT reaction conditions - Degraded DTT is a common issue. It is unstable and should be prepared fresh. You can also make single-use frozen aliquots that will last a couple months. Avoid freeze-thaw. When DTT is added to MnCl2 solution, a color change should occur.
  3. Low modification due to degraded reagents - I have not used NAI-N3, but I have used NAI. NAI DMSO solution is only stable for a few months at -20C. 1M7 will degrade very quickly in the presence of moisture. 1M7 DMSO solution should be prepared immediately before the experiment using anhydrous DMSO. If your 1M7 is bad, 5NIA is a cheaper and more reactive alternative: https://www.ncbi.nlm.nih.gov/pubmed/31117385.
Psirving commented 3 years ago

Weird, I got an email notification of a new message from STC on this issue, but I don't see the message displayed here. The message was regarding difficulties with amplification of a very high GC content IVT RNA with short tandem repeats. I personally don't have experience with this type of transcript, but Steve does. He wrote a paper on SHAPE with Huntingtin transcript (DOI:10.1021/bi401129r). Others in our lab have worked with this transcript and I believe the consensus was that it is very difficult due to the CAG and CCN STRs. I wonder if the Marathon RT enzyme would work for this type of transcript. (https://doi.org/10.1016/j.jmb.2020.03.022).

shapemapper commented 3 years ago

Yeah, there are a bunch of thermostable/extra-processive RT enzymes that remain underexplored for SHAPE-MaP type applications. The caveat is that there are probably tradeoffs between adduct detection rate and noise and enzyme processivity, so if you identify an enzyme that can RT a highly GC-rich RNA, there's no guarantee you'll get a usable adduct detection signal.

s-t-calus commented 3 years ago

@Psirving In my case, we do not have a huge problem with the identification of short tandem repeats >40, however with a high-GC rich region >95% or even 100% at a certain part of 3' UTR i.e. just before the RT starts. I've performed 3 consecutive experiments with ss-DNA adapter ligation (Lucks, SHAPE-Seq 2.0), random hexamers but also gene-specific primer that meant to bind in a GC balanced section of the 3' end. Unfortunately, all experiments and samples including DMSO control indicated a huge bias of the amplification, which I believe happened first on the cDNA synthesis level, as for the PCR we use universal Y-shaped Illumina adapters added by T/A ligation. Most of the amplicons are shorter and tend to start just after this extremely GC% region, alignment of reads to the reference sequence confirmed that very well. We've modified the nucleotide ratio for the RT-PCR steps to increase the abundance of GC's and increased temp. to allow these hydrogen bonds to break completely but it did not help. Two of the constructs I work on are much longer at the 3' end as they were IVT differently, these additional 50-70 AT-rich nucleotides helped to generate full-length constructs at a very small quantity. However, they may be strongly under-clustered on the Illumina flow cell and a significant majority of the data will represent shorter/truncated amplicons that make >95% of the PCR material.

@shapemapper Could you give me your feedback regarding SHAPE on the Huntingtin transcript (https://pubs.acs.org/doi/10.1021/bi401129r), I've looked for the sample prep section but since this is a paper from 2013 I would rather trust protocol by Smola 2015, tried to find a link to a FASTQ data to compare your read coverage but could not spot it. Did you observe such strong truncation of cDNA synthesis in complex RNA constructs? Moreover, would you recommend using MarathonRT by A.M.Pale 2020 paper to overcome such problems or just use random 9-mer primers for the RT step or ligate long AT-rich adapter construct to the 3' end of the SHAPE treated RNA that would allow the SSII enzyme to start priming earlier than at this >95/100% GC part?

Thank you very much for all your suggestions, I really appreciate your comments. s-t-c

PS.my previous comment had a few typos that I wanted to correct and something went wrong so it disappeared from here.

Psirving commented 3 years ago

I'm sorry you're having these troubles. We have experience with difficult transcripts like this, and well, they are difficult. I still think that the issue is likely a strong secondary structure interfering with RT. Some structures will just not melt, even at high RT temperatures. Another less likely possibility is the presence of natural RNA modifications that interfere with RT.

If you have full-length amplicons in your mixture, you might be able to enrich them by a size-selection method on your libraries at either/both PCR steps. Of the library prep-methods, I would expect GSPs for both RT and PCR to be the most robust in ensuring full-length products. Primers are cheap, so we often test several sets of primers to ensure the cleanest product with low off-target and primer-dimer amplification. 9-mers might work better than 6-mers, but both have a stong sequence bias. I don't have experience with the Y-shaped adapters.

I have used Marathon RT with SHAPE and DMS. In short, I would recommend it for DMS, but advise caution with SHAPE. It is processive. With both reagents, the background mutation rate is VERY low, and detection rate is similar to SSII using the conditions from the paper you mentioned. However, with SHAPE reagent, you lose almost all of the modification signal at "A". This is definitely a big downside and the effect on folding predictions is untested. I'm unsure if the Pyle lab had similar results in their work, but they don't advertise this in that paper. Another limitation of Marathon is that it does not extend every available template-primer pair. Despite better processivity, total cDNA yields may be lower than with SSII.

s-t-calus commented 3 years ago

@Psirving No need to be sorry, seems like we've badly designed these RNA constructs and have to take that in consideration in the future i.e. always include poly-Adenylation during the IVT but also avoid extreme GC% at the 3'UTR region. Generally speaking SHAPE-Seq is not an easy protocol (custom buffers etc.) and we've expected certain limitations with this method. At least I have a confirmation from both of you that such issues are also present in your hands and are difficult to resolve. We may try to reproduce Pyle's paper, however getting this custom RT polymerase may take some time. In the meantime we will try higher RT temp. and to re-PCR the longer constructs after the beads clean up. We had an idea to introduce 7-deaza-dGTP at IVT to avoid strong secondary GC structures but not sure how that would affect error profile and reactivity, do you know was it tested in the past?

@shapemapper Do you think application of nanopore technology with acetylimidazole could overcome issues related to extremely high GC region, or this method is still inaccurate and would produce lots of false signals https://www.biorxiv.org/content/10.1101/2020.05.31.126763v1

One of my colleagues just looked through the Htt constructs of your aforementioned paper and seems like you also had a relatively high amount of GC's at the 3' end. I believe we are having a similar problems, especially if "Others in our lab have worked with this transcript and I believe the consensus was that it is very difficult due to the CAG and CCN STRs", now it all makes sense for me! Does it mean you've used de novo + in silico modeling of this Htt construct or just repeated this experiment multiple times? We generated some data for our RNA constructs and I've been wondering how to rescue this dataset.

Once again thank you very much for all your suggestions. s-t-c

shapemapper commented 3 years ago

Hi s-t-c,

w/r/t the HTT transcript, my original publication on that transcript was done with fluorescent RT primer electrophoresis, not the MaP method. IIRC followup SHAPE-MaP experiments on that transcript did run into issues that suggested reduced reverse transcription through the GC-rich region, and we never got a great signal for that region (although it's probably strongly base paired so we wouldn't really expect much SHAPE signal in any case).

With a GC rich primer, you're likely to get off-target amplicon products, and you're also likely to select for shorter amplicons that don't span a "difficult" region of the transcript. If your products are 100% off-target, you're out of luck, but if you have any of the correct amplicon in your library, I would recommend trying the shapemapper --amplicon option plus a --primers file if you're not already: see https://github.com/Weeks-UNC/shapemapper2/blob/master/docs/analysis_steps.md#primer-trimming-and-enforcement-of-read-location-requirements. That will let you filter out the unintended products and focus on reads that start and stop at the designed primer sites.

As far as nanopore adduct detection goes, that's an interesting thought - I actually do think that an acetylimidazole-modified transcript like HTT might go through a nanopore fairly well. Will Stephenson was able to get ribosomes through, and those obviously have regions of high GC content. That said, I think nanopore depends to some degree on "threading" the molecule through the pore, so if the end of the RNA is folded over or inaccessible to a DNA splint, you might have trouble getting the molecules started through the hole. If it's in the context of a larger sequence I bet they would go through fine.

Psirving commented 3 years ago

We've been getting Marathon RT from Kerafast.com. They usually ship it quickly.

I'm not aware of anybody having done SHAPE-MaP on IVT's containing 7-deaza-dG's.

s-t-calus commented 3 years ago

Thank you for all the suggestions, in case of MarathonRT we may use quenched NAI, as 1M7 has got a very low mutational profile according to Figure 3b from that paper. Do you have experience in icSHAPE for RNA structure detection, do you think that method would solve the problem with high GC-rich region or NAI-N3 would encounter the same problems?

s-t-calus commented 3 years ago

Hello @shapemapper @Psirving sorry to bother you again. I just have a simple question regarding your 2019 paper i.e. "Guidelines for SHAPE Reagent..." is there a reason why you did not include Denaturing Control with your cellular RNA, seems like all the plots are generated only for SHAPE and DMSO control. Do you still use DC for SHAPE-MaP analysis (both cellular and IVT RNA) or this control is not that valuable and could be dropped in some sort of newest version of your ShapeMapper algorithm? Even in case of cellular mRNA there would be a possibility to heat up the RNA and in vitro treat it to generate denaturation control. I'm just interested is it a crucial control or you can generate reproducible results without, did you observe significant drop is results reproducibility when DC is absent?

Psirving commented 3 years ago

The quote below is from Steve's 2017 paper on ShapeMapper2: (https://doi.org/10.1261/rna.061945.117) We only use denatured control when we need the highest accuracy, and when we have high confidence in our ability to get a good quality sample. Mostly we don't consider this sample necessary.

""" Use of a denatured control Obtaining a denatured control (Siegfried et al. 2014; Smola et al. 2015b) for a MaP experiment can be challenging (and in some cases infeasible), uses valuable sequencing bandwidth, and can even hurt calculated reactivity profile accuracy if RNAs are degraded or overamplified. For these reasons, ShapeMapper 2 does not require the use of a denatured control. Most background mutations are accounted for using mutation rates from a no-reagent control, but when the very highest accuracy is desired, a denatured control can provide an approximate mutation detection rate correction that improves recovery of base-pairing information (Supplemental Fig. S6B).

s-t-calus commented 3 years ago

Hello @Psirving @shapemapper the enzyme you've suggested i.e. MarathonRT made a huge difference and the high-GC% region I was initially having problem with, got transcribed quite well with improved length distribution of amplicons, sequencing coverage and reliable SHAPE reactivity. Once again thank you very much for you help, however I have another question.

This time I have a construct that is AU rich and is potentially generating multiple kissing-loops or pseudoknots, the results of the RT-PCR with MarathonRT or SSII seems to be even worse than with previously described high-GC region. Is that something you've observed in the past while structure was unstable or very complex? Is there a chance to lover the MnCl conc. to force the enzyme and linearize such RNA or additional RT-PCR additive would be needed?

Moreover, did you have a chance to test novel SHAPE chemicals such as 2A3, B5 or NIC https://pubmed.ncbi.nlm.nih.gov/33398343/ Seems like these chemicals work better in vivo than NAI, however some of them are highly unstable so I'm not sure it is even worth trying them as chemicals are not available commercially.

Psirving commented 3 years ago

I'm curious if you see the same issue with signal detection at "A" nucleotides with MRT that I mention above. If you don't have a MaP signal for "A", I would probably recommend ignoring them for profile normalization and setting to no-data for structure prediction.

I don't remember having any specific issues with AU-rich regions before. We've had some luck generally with difficult targets using betaine in the PCR reactions. Lowering the MnCl2 concentration tends to have a deleterious effect on SSII. I think in the MRT paper they report some results at lower MnCl2 concentrations, but I don't remember what they were. I also seem to remember that MRT is more thermostable, so a higher RT temp might be an option.

If you're committed to this difficult construct, I recommend QC of the construct, the RT product, and the PCR products to try to determine the limiting step.

I have tried 2A3 and seen some promising results, but yes it is a hassle that you need to synthesize it and that it is unstable. It hydrolyzes and we think CO2 is also responsible for some of the decay, so if you are going to try it, I recommend keeping the reaction very dry and under inert gas for longer-term storage, or prepping it fresh for each reaction.