Open greatfireball opened 6 years ago
Available Data from F1:
Simulating Data with ART: http://www.niehs.nih.gov/research/resources/software/biostatistics/art/
Command:
art_illumina [options] -i
1:1 : ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl.fa -l 150 -f 16 -o a_thaliana_1_1_sim -m 500 -s 150
(only 16x coverage because reads of Chloro and A.thaliana was used 6x )
1:10 : ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl-1zu10.fa -l 150 -f 100 -o a_thaliana_1_10_sim -m 500 -s 150
1:100 : ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl-1zu100.fa -l 150 -f 100 -o a_thaliana_1_100_sim -m 500 -s 150
1:1000 : ./art_illumina -p -i sequence-arabidopsis-thaliana-kern-chl-1zu1000.fa -l 150 -f 100 -o a_thaliana_1_1000_sim -m 500 -s 150
chl-only: ../program/art_bin_MountRainier/art_illumina -p -i sequence-arabidopsis-thaliana-chl.txt -l 150 -f 100 -o a_thaliana_chl_only_sim -m 500 -s 150
Think we should try chloroplasts as contamination as well... Would suggest 10:1 100:1 1000:1 Genome vs. Chloroplast... This setting might simulates extracted nuclei with a little bit of contamination.
Opinions @PfaffS @iimog ?
I don't hate the idea. However, one thing to consider is that if we target 200x chloroplast coverage the last dataset would require a genomic coverage of 200,000x I don't think that is feasible or realistic. Even if we want to attempt assembly at 20x chloroplast coverage I can't (currently) imagine a genome sequenced to 20,000x coverage.
100% agree, but I would like to know what will happen if we only provide rare chloroplast sequences. Wrong assemblies? Error messages? Anything else?
Nevertheless, we are using a definition for ratio 1:1 of one complete host genome to one complete chloroplast genome. Another definition is also possible: 1:1 in that case means, that one read belongs to the host genome and the second read belongs to the chloroplast genome. (I just wanted to state that here to ensure, that we later remember our definition) :)
Yeah, good point. I'd suggest we first try it with 10:1 then. We could use a 500x covered genome (so chloroplast will be coverd 50x). With default parameters I expect ChloroExtractor to fail when it tries to scale reads to 200x coverage. We can then re-run ChloroExtractor with target coverage of 40x to see what happens then. I'm also curious how the other tools behave.
Generate test sets to evaluate assembler performance.
Therefore, use Arabidopsis genome and chloroplast from Genbank and simulate short read libraries fulfilling those characteristics:
Take care of the circular sequence of the chloroplast genome!
Use a simulation software which allows the usage of a random seed to ensure reproducability. Maybe this paper gives some ideas which tool to use.