Weeks-UNC / shapemapper2

Public repository for ShapeMapper 2 releases
Other
29 stars 16 forks source link

Should the target sequence always be transcriptome sequence? #19

Closed uaauaguga closed 3 years ago

uaauaguga commented 3 years ago

Hello Steven:    I have a little concern when using shapemapper2.   In STAR alignment in /internals/python/pyshapemap/components.py, the parameter "--scoreGap -1000000" was specified, with comment "disable splice junction detection". Bowtie2 does not support spliced alignment. Does this mean the target sequence should always be cDNA, and a genome sequence should never be used?   Further more, its clearly documented when counting mutations, "ShapeMapper infers the locations of chemical adducts as the reference sequence position immediately 5′ of the last unchanged reference nucleotide before a mutation scanning 3′→5′ ." It seems if genomic sequences were used, this will lead to different behaviors for transcripts in different strand.   If it is the case, is there any suggestions for preparing target sequences for genome wise study of eukaryote (one gene may have multiple isoforms)?   Thanks for providing such a useful pipeline. Regards.

shapemapper commented 3 years ago

Yes, shapemapper is designed to use cDNA target sequences written in the same 5-prime to 3-prime left-to-right sense, and is not designed to handle spliced genomic mapping.

For a genome-wide eukaryotic study, you're going to run into multiple problems, among them: 1) Unless you're sequencing to ridiculously high coverage, and/or using a reagent with a very high modification rate, and/or somehow normalizing your library, only a handful of the most highly-abundant transcripts in your sample will be sequenced to a depth that gives reliable SHAPE-MaP signal. 2) shapemapper will crash with large numbers of target sequences, see https://github.com/Weeks-UNC/shapemapper-txome for a possible workaround. If you go that route, if it were me I would limit the target sequences to just the most abundant isoform for each gene. 3) As you've mentioned, mapping against alternate transcript isoforms is not handled by the shapemapper pipeline as written. It is possible to run an alignment yourself, by calling STAR directly, and then "manually" run the components of shapemapper that count mutations using a sequence alignment as input. See https://github.com/Weeks-UNC/shapemapper2/blob/master/docs/modular_workflow.md

In any case, splicing is a tricky problem - you might reach out to current members of the Weeks or Laederach labs and see if anyone else has ideas.

uaauaguga commented 3 years ago

I see. Thanks for reply, it solved my problem.