Randomly assigned - Githubissues

BSSeeker / BSseeker2

A versatile aligning pipeline for bisulfite sequencing data

http://pellegrini.mcdb.ucla.edu/BS_Seeker2/

MIT License

60 stars 25 forks source link

Randomly assigned #24

Open olmath opened 5 years ago

olmath commented 5 years ago

Hi Weilong,

Thanks for putting BSseeker2 up, this is a just wonderful tool! Comparing the BAM files generated by BSseeker2 starting from paired-end reads, with the ones generated with another program (methylpy), I see that BSseeker2 manage to map reads to transposons present in several highly similar copies in the genome, whereas methylpy does not. Since BSseeker2 and methylpy both use bowtie2 to align, and since (if I understand correctly) both programs only work with uniquely-mapped reads by default, I am wondering why I observe such difference.

Could you help?

guoweilong commented 5 years ago

I did not read the code of methpy. Do you have simulation data to show which software run with higher accuracy for mapping?

Best, Weilong

olmath commented 5 years ago

Hi, No I did not make simulations. I've run both BSseeker2 and methylpy in parallel on the same paired-end data, and I observe mapping differences along transposons that are present in highly similar copies in the genomes. Below is an IGV snapshot centered on such a transposon, the top track is the output of methylpy, bottom track the output of BSseeker2: Capture d’écran 2019-04-03 à 15 29 36

If I blast reads that have been mapped by BSseeker2 and not by methylpy, I see that they map identically to several identical transposon copies.

Therefore, I think that BSseeker2 retain (at least some) reads mapping to multiple hits. Maybe those reads are then randomly assigned to the different identical hits.

Is that correct?

guoweilong commented 5 years ago

Hi @olmath ,

Thanks for your reports. Actually, BS-Seeker2 is strict in choosing best unique hits. You can read our paper (BMC Genomics, 2013) and found that we report at least 2 best hits from two strands alignment, and then calculate the best unique hits finally. From our simulation data, the mapping accuracy is as high as >99%. Thus, I don't think there are randomly assigned reads in BS-Seeker2 pipeline.

Actually, I have done lots of REPEATS regions study in human and mouse (http://www.nature.com/articles/srep32207)[http://www.nature.com/articles/srep32207], where the TE regions also have lots of well-mapped reads.

For the other software you mentioned, I have never use it, and would not make any comment on it. Maybe an advantage of BS-Seeker2 is that its mappability is higher.

Best, Weilong