Generade-nl / EelSeeds

Scripts used to extract seeds for the European eel genome assembly
3 stars 1 forks source link

choice of amount of Illumina data #3

Open dcopetti opened 4 years ago

dcopetti commented 4 years ago

Hello,

I am preparing the short read data with these two scripts, and I wonder if you have some guidelines to help the choice. I am assembling a 5 Gb genome, I have about 30x cov of 1x260 bp reads (I took only one end so that I do not need to merge them - if it is better, I can merge the pair to get ~460 bp Flashed reads), and I got my repeat threshold from the Jellysifh histo - I chose 58 (k-mer size 25). The high peak is the het, the lower at ~60x is the homo. Capture

I edited the binLongEelReads.perl to extract sequences in the range 230-245: should I stay as high as I can? How important is this length value? Is it worth to use 460 bp reads? But with longer reads, more may contain high-copy k-mers. I am also wondering if I should parse all the 30x to get the sequences to align to my long reads, or if there is a threshold I can stop at. Thanks,

Dario

dcopetti commented 4 years ago

Hello, I went ahead with running the steps and from the second perl script I got this line:

$ perl ./binLongEelReads_250-265.perl merged_reads_200_58.fa
length 250 10606826
length 255 10595691
length 260 10581447
length 265 7024869
Warning: unable to close filehandle properly: Bad file descriptor during global destruction.
$perl -v
This is perl 5, version 22, subversion 0 (v5.22.0) built for x86_64-linux-thread-multi

I looked at the tail of the fastas, and the formatting looks fine. I wonder if I should disregard the error or e.g. the files may be incomplete. Thanks, Dario