Generade-nl / EelSeeds

Scripts used to extract seeds for the European eel genome assembly
3 stars 1 forks source link

Selecting seeds for TULIP from only PacBio reads #1

Open mictadlo opened 6 years ago

mictadlo commented 6 years ago

Hi, Do you have any instructions how to select seeds for TULIP from only PacBio reads?

Thank you in advance.

Michal

Generade-nl commented 6 years ago

Hey Michal,

There are multiple strategies that you could follow, for one you could try to get seeds from a reference sequence that you might have. When you don't have an high-quality reference at hand you could use the beginning and the end of reads as seeds. When using this method make sure the reads do not contain any part of adapter-like sequences in your reads (sometimes adapter sequences are not perfectly trimmed and might remain in your reads, hence when you take seeds from the beginning and end of reads you might end up generating seeds from adapter sequences, which you don't want). You might want to skip a X number of bases before taking a seed and similarly leave a X number bases at the end.

Another thing to might want to take into consideration is the fact that PacBio generates reads according to a specific method - it takes doubles stranded DNA and makes the molecule circular by ligating adapter sequences at the end and the beginning of the molecule, then the circular molecule is readout multiple times. Hence the dataset contains multiple copies of the same molecule, you can verify this using headers in your FASTA file. Here is a simple example:

m54072_170926_110145/4194524/3729_8028 m54072_170926_110145/4194524/8106_12470

These are two headers of reads that come from the same ZMW (4194524 - separated by '/' second figure in the header) the final figure indicates the number of nucleotides that have been recorded.

When you align seeds to reads that actually represent the exact same molecule you might overcomplicate the assembly graph which leads to less contiguity and more fragmentation. We therefore think it is best to filter a read from every ZMW. On a general note - try to filter for long and best quality reads to get the best assembly.

Hope this helps you solving your assembly problem,

Cheers Michael