Supporting for stLFR reads

YingYa commented 6 years ago

Will EMA support mapping reads from the stLFR (http://dx.doi.org/10.1101/324392), where the uncorrected barcode is at the 3'-end of read2.

arshajii commented 6 years ago

Hi @YingYa, This looks like it should be easy to support during preprocessing. Are there any unique properties of this data type compared to 10x? e.g. what is the barcode length? read length? avg. reads per barcode? If these are at least comparable to 10x then we can likely just add a preprocessing flag for this data type and run EMA as is.

YingYa commented 6 years ago

Hi @arshajii , There are some properties of the stLFR data:

The raw data were sequenced by BGISEQ-500 with read length 100 (read1) + 100 (read2) + 54 (read2).
The whole barcode is combined by three sub-barcodes (B1, B2, B3) randomly, which is located among the tail 54bp of read2.
There are 1536 sub-barcodes with fixed length of 10bp. Go to 'https://www.biorxiv.org/highwire/filestream/99369/field_highwire_adjunct_files/1/324392-2.xlsx' for more detail.
The tail 54bp of read2 contain three sub-barcodes (B1, B2, B3) and two other primers (P1, P2). And 54 (position from 101 to 154) = B1 (10bp) + P1 (6bp) + B2 (10bp) + P2 (18bp) + B3 (10bp).
1 mismatch maybe allow for each sub-barcode when processing the data.
avg. reads per barcode would be 40~50 in 100Gb data.

YingYa commented 6 years ago

Hi @arshajii ,

What does each parameter mean and I want to create a new PlatformProfile profile in 'src/techs.c' for stLFR?

Thanks

arshajii commented 6 years ago

Here's a brief description of each; let me know if you need any more info about any parameter.

name: Unique name string for the platform
extract_bc: Pointer to a function that extracts the barcode from a FASTQRecord object. Probably the easiest thing to do in your case would be to format your FASTQs in the way EMA expects (with :<barcode sequence> after the FASTQ identifier, like @read1:ACGTACGT) then just use the 10x barcode parsing function, extract_bc_10x.
many_clouds: Some technologies (e.g. Moleculo) can have many reads per barcode, which necessitates slight changes to the algorithm. Looks like your technology doesn't have too many reads per barcode though, so you can just set this to 0.
dist_thresh: Distance threshold to use when grouping alignments into clouds. If your fragment lengths are similar to 10x's, you can probably use the same value of 50k, otherwise scaling this proportionally would probably be okay.
error_rate: Per-nucleotide error rate of the sequencer (e.g. 10x has a 0.1% error rate, so we set this to 0.001).
n_density_probs/density_probs: This encodes the probability of seeing a particular number of reads in a 1kb window within a fragment. For example, for 10x we have density_probs = [0.6, 0.05, 0.2, 0.01] (and therefore n_density_probs -- the length of density_probs -- is 4); this means there's a 60% chance of seeing zero reads in a 1kb window, 5% of seeing one read, 2% of seeing two reads (i.e. one read pair) and 1% of seeing 3 reads. Higher read numbers' probabilities are scaled down exponentially automatically, which is why these probabilities don't sum to 1. If you don't plan to use the read density optimization feature of EMA, you can ignore all this. Otherwise, the best way to determine these probabilities is to do a regular alignment and look at uniquely-mapping fragments, and create a histogram of read counts per 1kb window.

zhangtongda commented 5 years ago

Will EMA support mapping reads from the stLFR? it's solved ??

arshajii commented 5 years ago

@zhangtongda: Check out @YingYa's fork. It looks like he was adding stLFR support. For now I'll close this issue as we don't have plans to add this ourselves; hopefully @YingYa can reply to let us know the status of his fork and if he was able to test on stLFR data.

arshajii / ema

Supporting for stLFR reads #24