Simulate test sets using Arabidopsis genome and chloroplast sequence

chloroExtractorTeam / chloroplast_landscape

Chloroplast landscape for different plant species

MIT License

0 stars 0 forks source link

Simulate test sets using Arabidopsis genome and chloroplast sequence #6

Open greatfireball opened 6 years ago

greatfireball commented 6 years ago

Generate test sets to evaluate assembler performance.

Therefore, use Arabidopsis genome and chloroplast from Genbank and simulate short read libraries fulfilling those characteristics:

Read length (100, 150, 250 bp)
Insert size (overlapping by 50 %, overlapping by 10 %, 100, 200, 500 bp)
Different ratios of genomic DNA to chloroplast DNA (500:1, 200:1, 100:1, 50:1, 10:1, 1:1, 1:10, 1:50, 1:100, 1:200, 1:500)

Take care of the circular sequence of the chloroplast genome!

Use a simulation software which allows the usage of a random seed to ensure reproducability. Maybe this paper gives some ideas which tool to use.

PfaffS commented 6 years ago

Available Data from F1:

Simulating Data with ART: http://www.niehs.nih.gov/research/resources/software/biostatistics/art/ Command: art_illumina [options] -i -l -f -o -m -s

   1:1 : ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl.fa -l 150 -f 16 -o a_thaliana_1_1_sim -m 500 -s 150

(only 16x coverage because reads of Chloro and A.thaliana was used 6x )

   1:10 : ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl-1zu10.fa -l 150 -f 100 -o a_thaliana_1_10_sim -m 500 -s 150

   1:100 :  ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl-1zu100.fa -l 150 -f 100 -o a_thaliana_1_100_sim -m 500 -s 150

   1:1000 :  ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl-1zu1000.fa -l 150 -f 100 -o a_thaliana_1_1000_sim -m 500 -s 150

  chl-only: ../program/art_bin_MountRainier/art_illumina -p -i sequence-arabidopsis-thaliana-chl.txt -l 150 -f 100 -o a_thaliana_chl_only_sim -m 500 -s 150

greatfireball commented 6 years ago

Think we should try chloroplasts as contamination as well... Would suggest 10:1 100:1 1000:1 Genome vs. Chloroplast... This setting might simulates extracted nuclei with a little bit of contamination.

Opinions @PfaffS @iimog ?

iimog commented 6 years ago

I don't hate the idea. However, one thing to consider is that if we target 200x chloroplast coverage the last dataset would require a genomic coverage of 200,000x I don't think that is feasible or realistic. Even if we want to attempt assembly at 20x chloroplast coverage I can't (currently) imagine a genome sequenced to 20,000x coverage.

greatfireball commented 6 years ago

100% agree, but I would like to know what will happen if we only provide rare chloroplast sequences. Wrong assemblies? Error messages? Anything else?

Nevertheless, we are using a definition for ratio 1:1 of one complete host genome to one complete chloroplast genome. Another definition is also possible: 1:1 in that case means, that one read belongs to the host genome and the second read belongs to the chloroplast genome. (I just wanted to state that here to ensure, that we later remember our definition) :)

iimog commented 6 years ago

Yeah, good point. I'd suggest we first try it with 10:1 then. We could use a 500x covered genome (so chloroplast will be coverd 50x). With default parameters I expect ChloroExtractor to fail when it tries to scale reads to 200x coverage. We can then re-run ChloroExtractor with target coverage of 40x to see what happens then. I'm also curious how the other tools behave.