PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 102 forks source link

Unable to decrease the Identity option ( -e ) below .70 in pa_HPCdaligner_option. #52

Closed manaburn closed 9 years ago

manaburn commented 9 years ago

Hello,

We have sequenced 80x Pacbio Reads of Rice Genome , but the Quality of these data is relatively low ( Low alignment identity compared to Ecoli Test Data , Fruitfly Data etc. ) .

Then I tried to assemble the genome using the parameter in Ecoli Test but the genome size is underestimated much ( 290Mb out of 370Mb ) and the N50 is only 100Kb.

I wonder there could be much lower Identity among reads ( 80% Identity to Rice Reference, <70% in Read-to-Read ) and try to decrease the Identity option in pa_HPCdaligner_option to .60 or .50. The program returns an error message "HPCdaligner: Average correlation must be in [.7,1.) (0.5)". Could you please give some more freedom of this option in the code? AND Any experience OR suggestions to deal with these low quality data ?

Regards

Bin

Below is the Rice raw reads: ricerawreads Below is the Rice corrected reads using FALCON: ricecorrectedreads Below is the Ecoli raw reads: ecolirawreads Below is the Ecoli corrected reads using FALCON: ecolicorrectedreads

pb-jchin commented 9 years ago

@manaburn thanks for the detailed information. Do you have configure file your use? I don't think the issue is the Identity. Your raw accuracy does have a different histogram from the E. coli. ( I suspect some overload situation but I don't want to make any conclusion without the right information.) However, the histogram of the length of the p-reads seems indicate you might not get enough reads overlapped. This can be due to (1) overlapper filtered out due to repeat, or (2) the cap of reads used for error correction is too low.

For (1), you can use smaller block with the same -t to increase the sensitivity. For (2), you can increase the number of reads used for error correction. Again, one might have to look the overlap statistics to get a sense about what is going on.

pb-jchin commented 9 years ago

ok.. I think a different explanation of lower alignment identity is probably due to strain differences.

manaburn commented 9 years ago

@pb-jchin Here is my configuration file:

[General] input_fofn = input.fofn input_type = raw

length_cutoff = 1000 length_cutoff_pr = 1000

pa_concurrent_jobs = 300 ovlp_concurrent_jobs = 300

pa_HPCdaligner_option = -v -dal4 -t16 -e.70 -l1000 -s1000 ovlp_HPCdaligner_option = -v -dal4 -t32 -h60 -e.96 -l500 -s1000

pa_DBsplit_option = -x500 -s100 ovlp_DBsplit_option = -x500 -s100

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 3 --local_match_count_threshold 2 --max_n_read 200 --n_core 4

overlap_filtering_setting = --max_diff 30 --max_cov 30 --min_cov 3 --bestn 10

I have read the previous issues and found that -s100 option in pa_DBsplit_option can change the block size . Decrease it will give smaller block and more sensitivity. For the cap of reads used for error correction, is -t in pa_HPCdaligner_option which control it ? In DALIGNER manual, -t is used to suppress over-represented kmers.

In overlap_filtering_setting, are --max_diff and --max_cov based on raw data ( 80x ) OR corrected data ( 27x ) to set ?

Anyway, typically which command is used to get a overlap statistics? I found LAshow to get every overlaps in las files. Need I write a small script to get these overlap result together and calculate them ?

Regards

Bin

pb-cdunn commented 9 years ago

We are hoping to have more automated selection of DALIGNER options someday. But until then, you might have to tweak it. Anyway, I think Jason is saying that because your corrected Rice reads are so short, you probably have multiple strains in your samples. I'm closing this, but feel free to re-open if you find more information which could help others.