Weeks-UNC / shapemapper2

Public repository for ShapeMapper 2 releases
Other
29 stars 16 forks source link

Mapped reads from amplicon wokflow - Error #30

Closed angelika888 closed 2 years ago

angelika888 commented 2 years ago

Hello,

I have run ShapeMapper2 to calculate map data for two 1500-nt overlapping amplicons. During library preparation they were fragmented, so not all obtained reads contain the sequence of amplicon PCR primer. Input sample was only one target RNA (finally DNA of course :)).

During calculation I used flag "--amplicon". First, after ShapeMapper calculation I got warning:

WARNING: no random primer length was specified, but at least one RNA is longer than a typical directed-primer amplicon. If analyzing a randomly primed experiment that was not subjected to a Nextera prep, use the --random-primer-len option to exclude mutations within primer binding regions.

Second, I got incorrectly looking mapped depth plots (please see below).

Do you know how can I improve such calculation?

PS. I also saw a nice work (https://doi.org/10.1016/j.molcel.2020.12.041), where they used longer amplicons and the ShapeMapper2 calculation. So, I hope that used approach is not a problem.

Regards! Angelika

Mapped_depth_plot

Psirving commented 2 years ago

This is the intended effect of --amplicon. This option enforces that read ends align to the primer locations. It looks like you also used --primers with a primers file to indicate the locations of primers for multiple amplicons. This option will trim the primer sequence if it is found at the ends of a read. Try running shapemapper again with the --primers option, but not the --amplicon option.

angelika888 commented 2 years ago

Thank you for your answer! I have just run the calculation with only the --primers option. Nevertheless, I got the same error. I also have run the other calculation - I uploaded only half of the target sequence (1. amplicon) flanked by the first pair of amplicon primers. That calculation was without --amplicon and --primers, only lowercase in the target sequence. I got a mapped depth plot as you can see below. Does it look better in your opinion? Can I calculate map data for each primer separately? Will it produce reliable data?

1primer_pair

Psirving commented 2 years ago

Seeing as this is one of the main workflows of the SHAPE-MaP wet lab experiment, I should know the solution to this, but I don't. I will look into this further, but I've come up with 2 possible solutions that might help you for now.

  1. Run the shapemapper software as you did originally, but use --max-primer-offset 1500. This should allow reads which don't map directly to amplicon primers, but if it does, it should remove the primer regions from further analysis steps. If you try this method, please send me the mapped depth figures so I can see if this behaved as expected.
  2. Run the shapemapper software separately on each amplicon without the --amplicon flag, and manually concatenate the ".map" files in Excel and save it as a tab delimited file with ".map" extension. For the first amplicon, Replace the 5' primer region with -999 values, and delete the 3' primer region. For the second amplicon, delete the overlap with the first amplicon, and replace the 3' primer region with -999 values. Make sure the Nucleotide column is continuous from 1 to ~2600, and that the sequence column is correct.
Psirving commented 2 years ago

@shapemapper @anthonymustoe Do either of you know the correct solution here? Is there an easy way to analyze long, tiled amplicons which have been fragmented (Amplicon workflow from Smola et. al. 2015) using standard ShapeMapper parameters?

angelika888 commented 2 years ago

So, I've tested your first solution and I got a different plot (see below), but with the 400-nt gap in the middle (position 900-1300). There was no gap in such a location when I used the second solution with one pair of primers (see my previous answer).

Additionally, when I looked below read depths histograms, there were strangely low 5th percentile depths only for 1. solution (see below).

Besides that, Pearson correlations for these two SHAPE reactivity datasets (only for 25-700 nt, because of the mentioned gap) is 0.99, and both differ from primary data (about 0.75).

What do you think about that? Mapped_depth Read_depths

Psirving commented 2 years ago

Shoot, this is what I was worried about. Because we increased --max-primer-offset 1500 Shapemapper is confused about which amplicon to align to in the area of overlapping amplicons. It seems to favor amplicon 2 for some reason (orange line), then trims all of those reads 5' of the forward primer (orange unshaded region). Unless Tony or Steve reply with a different solution, I think you will have to analyze these amplicons separately, then manually combine them in Excel. If the data for each amplicon are in separate fastq files, i.e. if they were given different sequencing barcodes, uncombine them before the analysis step.

As a side note, you see three regions of higher read depth. The one in the middle makes sense, this is the region where you have the combined read depth of amplicon 1 and 2. But the 5' and 3' regions also have much higher read depth. It could be due to off-target priming in these regions. If thats the case, you might see some weird reactivity patterns and high background reactivity in the area around 50 and 2700. If you see a weird pattern there, you should mask those regions, i.e. change the data in the .map file to -999. If you have an experimental or library prep justification for those higher read depths, then I wouldn't worry about it.

angelika888 commented 2 years ago

Yeah, I noticed these regions of higher depth. I It's because I've divided my target sequence into more amplicons and I have tested calculation for only part. At the 3' end the program also found reads for the next amplicon. At the 5' end there is a repeating of sequence from other location. My question is can I input all amplicons in the one library and use a single common fastq file as I did here?

anthonymustoe commented 2 years ago

I am not sure I entirely follow what is going on, but I have a few comments: -You definitely don't want to be using the amplicon flag, as Patrick said

-I am not sure what the primers option does in the absence of amplicon. Have you tried running ShapeMapper without any of the options -- just proving a totally uppercase fasta sequence for alignment?

-The high read depths you are seeing at the 5' and 3' ends of each amplicon are odd compared to what I am used to. How did you prepare library? For a Nextera prep, you should see minimal reads at the ends. Thus, I am assuming you used a ligation prep. Perhaps the depth profiles you see in the 2nd plot are correct? Otherwise, I think read depth profile reasonable?

-To get a better idea for what your library looks like, you could just do an alignment using STAR and then viewing the reads in IGV. Alternatively, you could use the --output-aligned-reads option, which should give you the sam file which you can view in IGV. This can give you an idea of whether the alignment seems to be correct or if there are artifacts.

-In principle, you should be able to input all amplicons in one library and align using a single common fasta reference. The only issue you'll have is at the primer sites. You'd want to mask those out when they fall at 5' or 3' end of read. I am not sure ShapeMapper does this. Alternatively, you can just mask out all data from these regions.

-The "WARNING: no random primer length was specified" is a standard warning and is not an error. It is just advising you that you should consider using the random-primer flag. It doesn't hurt to use; it just will reduce your effective read depth to a small degree.

angelika888 commented 2 years ago

Thank you for your advice! I've tested the calculation for 3 amplicons together without flags and totally uppercase sequence. The plot I obtained you can see below - there is also higher depth for regions where amplicons overlap. Yes, the library was prepared by ligation. Is this way the correct one to analyze such data by ShapeMapper? Plot

shapemapper commented 2 years ago

I don't really have the bandwidth to dig into this, but I'll comment that the depth profile you've just shown is in some ways the best so far - the vast majority of reads are mapping. However, the downside is you clearly have primer and primer-dimer reads mapping/contributing as well (see the jump at each primer site). This will artificially reduce your observed mutation rate over those primer regions, leading you to underestimate the chemical modification rate in those regions. Outside of those regions you can feel pretty confident.

On Thu, May 19, 2022, 5:09 PM Angelika @.***> wrote:

Thank you for your advice! I've tested the calculation for 3 amplicons together without flags and totally uppercase sequence. The plot I obtained you can see below - there is also higher depth for regions where amplicons overlap. Yes, the library was prepared by ligation. Is this way the correct one to analyze such data by ShapeMapper? [image: Plot] https://user-images.githubusercontent.com/64003504/169404653-603c8dce-a6f2-4118-8e85-5ca2c85ac232.png

— Reply to this email directly, view it on GitHub https://github.com/Weeks-UNC/shapemapper2/issues/30#issuecomment-1132210532, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTVDNRJXBODVRTBVMLXGGTVK2UZRANCNFSM5WE3TUVA . You are receiving this because you were mentioned.Message ID: @.***>

anthonymustoe commented 2 years ago

I would slightly disagree with Steve in that the high depth at the primer overlaps could be an odd feature of a ligation-based library prep. Perhaps the ends of the dsDNA amplicon have preferential ligation efficiency and so are overrepresented in the library. This wouldn't necessarily compromise data. With that said, you'll definitely want to go back and exclude your primer regions from the data. Those are definitely not trustworthy. Also, I would approach the data from the entire high-depth regions with skepticism.

Psirving commented 2 years ago

Thank you Steve and Tony, I was hoping there was a simple way to remove primer sequences that occur at the end of reads while keeping reads that occur in the center of the amplicon. This would be a useful feature to add at some point. I will look into this if I have time.

@angelika888 In short, yes this is the most correct way to analyze these data. But you will need to ignore the data at primer binding regions by changing the values to the mask values (-999 for .shape or .map, nan for profile.txt).

If you want to get data on the primer binding regions, you would have to re-prep your libraries so that each amplicon has different NGS barcodes and analyze them individually using the --amplicon flag. I have a script I can share that will combine the profiles after running shapemapper.

shapemapper commented 2 years ago

Looks like I set up the high-level wrapper to require read mapping to primer sites when --primers is provided. But I think the underlying mutation counter supports primer trimming without requiring end mapping (would need tested obviously). You might just be able to add another high-level option and tweak the logic around line 366 in pipeline_arg_parser.py

if len(p.primers) > 0 or p.primers_in_sequence: p.trim_primers = True # p.require_forward_primer_mapped = True p.require_reverse_primer_mapped = True

Maybe just expose the require_reverse_primer_mapped and require_forward_primer_mapped params at the top level and have those take precedence if the user provides.

On Fri, May 20, 2022, 10:31 AM Patrick Irving @.***> wrote:

Thank you Steve and Tony, I was hoping there was a simple way to remove primer sequences that occur at the end of reads while keeping reads that occur in the center of the amplicon. This would be a useful feature to add at some point. I will look into this if I have time.

@angelika888 https://github.com/angelika888 In short, yes this is the most correct way to analyze these data. But you will need to ignore the data at primer binding regions by changing the values to the mask values (-999 for .shape or .map, nan for profile.txt).

If you want to get data on the primer binding regions, you would have to re-prep your libraries so that each amplicon has different NGS barcodes and analyze them individually using the --amplicon flag. I have a script I can share that will combine the profiles after running shapemapper.

— Reply to this email directly, view it on GitHub https://github.com/Weeks-UNC/shapemapper2/issues/30#issuecomment-1132967165, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTVDNVXJSPG6RGK5WXOBGDVK6O5FANCNFSM5WE3TUVA . You are receiving this because you were mentioned.Message ID: @.***>

angelika888 commented 2 years ago

I was thinking about it and in the case I have all amplicons reads in one fastq file and if for now, there is no option to remove primer sequences during the analysis of all amplicons together, I have two solutions among I tested above (please, correct me if I'm wrong): 1) I can calculate all amplicons together without flags and all uppercase, BUT in final files, I have to mask all primer sequences by 'no data' 2) I can calculate each amplicon separately using lowercase to point out primers. Then I can combine them as Partick said (for the first amplicon, replace the 5' primer region with -999 values, and delete the 3' primer region. For each next amplicon, delete the overlap with the previous amplicon and delete the 3' primer region. For the last amplicon, delete the overlap with the previous amplicon, and replace the 3' primer region with -999 values. In this solution, overlapping amplicons allow for avoiding no data at primer binding sites in the middle of the sequence.

I will also analyze this alignment with IGV. If this higher read depth is only due to amplicon overlapping, should I use these data with skepticism?

I can also re-prep the test library to check if separate fastq files for each amplicon return me the same data in these regions as for a single fastq file.

Psirving commented 2 years ago

Option 2 won't work. The reason is that the primer cDNA sequence is still being counted for mutations, but they don't contain any. You tried this on one of your amplicons already, I've circled the problem areas, which contain primer sequences from the adjacent amplicons. Option 1 is the way to go. It is not a big deal to have a few regions of no data, but if you really want to capture that data, you will have to re-prep the libraries.

I will look into Steve's suggestion in changing the ShapeMapper code to correctly handle these data. I will let you know if I can fix it quickly or if it will take some time.

image

Psirving commented 2 years ago

Can you share your raw sequencing reads with me, and I will test Steve's solution?

angelika888 commented 2 years ago

Ok! I'll send you my files. I hope it'll help. Is your email address 'psirving@email.unc.edu' correct?

I've also performed some comparisons of Shapemapper calculations. Please see below - there is a difference in plots when I use different alignment software. What do you think about that? Above, for 3 amplicons I also used STAR and maybe that's the reason why 5' and 3' ends didn't have such high depth regions. Bowtie_STAR

angelika888 commented 2 years ago

When I used also lowercase in primer regions together with STAR aligner, it returns the same plot and on SHAPE reactivity plot primers sequences are marked as 'no data'. But when I used --primers option, the generated plot was as the first one at the beginning of this issue.

Psirving commented 2 years ago

Yes, that's my email address. Depending on the size of your files, it might not work.

Here it looks like STAR aligner is less likely to include reads that extend beyond the given reference sequence. You can see that it ignores a lot of the reads that come from the 5' adjacent amplicon because the overlap is much shorter, but we still have the same problem at the 3' end. It also seems to produce slightly more low mapping quality alignments.

This is interesting, but it doesn't solve the issue.

Psirving commented 2 years ago

@angelika888 Sorry it took me awhile to get back to you. Unfortunately, I couldn't find a simple change in the source code that would give us the desired result. As mentioned in your email, ShapeMapper also has trouble with circular RNAs. I think the best solution for you going forward is to prepare amplicons with different barcodes, run ShapeMapper on each amplicon, then combine the profiles manually.

Keep in mind that amplicons that don't overlap the back-splice junction could also amplify from linear RNA.