elzbth / jitterbug

Jitterbug is a bioinformatic software that predicts insertion sites of transposable elements in a sample sequenced by short paired-end reads with respect to an assembled reference.
17 stars 8 forks source link

typo on bin size? #7

Closed burnsro closed 7 years ago

burnsro commented 7 years ago

taken from here:

example usage: run with bam file, write to specified directory with specified prefix. Parallelize: use 8 threads, separating by 50 Kbp bins jitterbug.py --numCPUs 8 --bin_size 50000000 --output_prefix /path/to/my/dir/prefix sample.bam te_annot.gff3

50 Kb = 50,000 50 Mb = 50,000,000

Also this is quite a large bin size, is that standard?

mbosio85 commented 7 years ago

Hi,

you're right it's 50Mb :) thanks for noticing. It's a bin size to split the file processing, it does not merge data in 50Mb windows. Simply data are processed in windows of 50Mb, so the processing can be parallelized without being dominated by data transfer (it would happen with much smaller windows).

And nope, it's not standard, it's a simple decision made after in-house tests to split the work

thanks, I'll correct it in the next commit.

Mattia

burnsro commented 7 years ago

OK, from doing test runs (with window size increasing from 1Kb to 50Kb) using Arabidopsis data cited in the Jitterbug paper (mapping of Ler1 to Col-0 reference, paired end reads 180bp fragment size 80bp reads), I find a correlation between how many TEs are found and bin size, taken from the gff3 file it produces.

For example, 1Kb: 7 TE insertions 2Kb: 64 TE insertions 5Kb: 250 TE insertions 10Kb: 470 TE insertions 15Kb: 672 TE insertions 25Kb: 839 TE insertions 50Kb: 1000 TE insertions

So I'm wondering what parameters go into choosing a suitable bin-size, it if has something to do with Jitterbug being better at finding breakpoints in bigger windows?

mbosio85 commented 7 years ago

Hi,

sorry for the delay. Do you see any saturation effect? We included the windowing with big windows to parallelize the work. Jitterbug does generate clusters looking within a window and smaller window give less opportunity to find clusters falling in the same interval.

We prefer to include "as much as possible" in the windowing phase, and later filter by quality metrics etc to improve precision .

Hope it helps Mattia

burnsro commented 7 years ago

What would these quality metrics be? Coverage?

mbosio85 commented 7 years ago

Hi,

the quality metrics are those specified in the filter, like number of reads supporting the TEI, TEI size, consistency of supporting TEI reads if they all come from the same type . Mattia