dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
72 stars 40 forks source link

very low n. of cluster recovered #321

Closed gscabanne closed 5 years ago

gscabanne commented 5 years ago

Hi, I am trying to assemble a ddRAD data set. I have five index groups, which I demultiplexed independetly and then merged into a single project. After step 2 for filtering (and saving no less than 1000 000 per sample), I have a problem in step 3. Specifically, in a total of 22 samples I found between 30000 to 180000 clusters per sample, but most of those samples are not kept because in theory present very low depth. The report says that most of those clusters have a depth of “one”…but, how a cluster could be of depth one? (only one sequence?). It is worth noting that I used depth-min cluster of six. The point is that one colleague has already assembled with ipyrad this same data-set, in another computer, and did not have this problem! So far, we cannot figure out what would be the cause of this issue. What do you think?

isaacovercast commented 5 years ago

Can you post the params file you are using?

A cluster of depth one is a singleton, meaning it is uniquely different from all other reads based on your chosen clustering threshold. This is common.

What parameters did your colleague use to assemble the data?

This is more of a chat than an actual bug in ipyrad so if you want to jump on the ipyrad gitter channel to respond i will close this issue.

gscabanne commented 5 years ago

This is the param file...

------- ipyrad params file (v.0.7.28)-------------------------------------------Af                             ## [0] [assembly_name]: Assembly name. Used to name output directories for assembly steps/mnt/c/Demult2016/2016demultII/index1 ## [1] [project_dir]: Project dir (made in curdir if not present)                               ## [2] [raw_fastq_path]: Location of raw non-demultiplexed fastq filesMerged: index1II, index2II, index3II, index6II, index12II ## [3] [barcodes_path]: Location of barcodes file                               ## [4] [sorted_fastq_path]: Location of demultiplexed/sorted fastq filesdenovo                         ## [5] [assembly_method]: Assembly method (denovo, reference, denovo+reference, denovo-reference)                               ## [6] [reference_sequence]: Location of reference sequence fileddrad                            ## [7] [datatype]: Datatype (see docs): rad, gbs, ddrad, etc.TGCAG,                         ## [8] [restriction_overhang]: Restriction overhang (cut1,) or (cut1, cut2)5                              ## [9] [max_low_qual_bases]: Max low quality base calls (Q<20) in a read33                             ## [10] [phred_Qscore_offset]: phred Q score offset (33 is default and very standard)6                              ## [11] [mindepth_statistical]: Min depth for statistical base calling6                              ## [12] [mindepth_majrule]: Min depth for majority-rule base calling10000                          ## [13] [maxdepth]: Max cluster depth within samples0.85                           ## [14] [clust_threshold]: Clustering threshold for de novo assembly0                              ## [15] [max_barcode_mismatch]: Max number of allowable mismatches in barcodes2                              ## [16] [filter_adapters]: Filter for adapters/primers (1 or 2=stricter)35                             ## [17] [filter_min_trim_len]: Min length of reads after adapter trim2                              ## [18] [max_alleles_consens]: Max alleles per site in consensus sequences5, 5                           ## [19] [max_Ns_consens]: Max N's (uncalled bases) in consensus (R1, R2)8, 8                           ## [20] [max_Hs_consens]: Max Hs (heterozygotes) in consensus (R1, R2)4                              ## [21] [min_samples_locus]: Min # samples per locus for output20, 20                         ## [22] [max_SNPs_locus]: Max # SNPs per locus (R1, R2)8, 8                           ## [23] [max_Indels_locus]: Max # of indels per locus (R1, R2)0.5                            ## [24] [max_shared_Hs_locus]: Max # heterozygous sites per locus (R1, R2)5, 0, 0, 0                     ## [25] [trim_reads]: Trim raw read edges (R1>, <R1, R2>, <R2) (see docs)0, 0, 0, 0                     ## [26] [trim_loci]: Trim locus edges (see docs) (R1>, <R1, R2>, <R2)p, s, v                        ## [27] [output_formats]: Output formats (see docs)                               ## [28] [pop_assign_file]: Path to population assignment file

GUSTAVO SEBASTIÁN CABANNE Investigador-Researcher Museo Arg. de Cs. Naturales "Bernardino Rivadavia", Buenos Aires, Argentina Tel: 54 11 4982 6595 int 186 Editorial Team member of  Revista Brasileira de Ornitología, and Molecular Phylogenetics and Evolution

http://gscabanne.wixsite.com/science http://www.conicet.gov.ar/new_scp/detalle.php?keywords=&id=32339&datos_academicos=yes

On Friday, November 16, 2018, 2:59:51 AM AST, Isaac Overcast <notifications@github.com> wrote:  

Can you post the params file you are using?

A cluster of depth one is a singleton, meaning it is uniquely different from all other reads based on your chosen clustering threshold. This is common.

What parameters did your colleague use to assemble the data?

This is more of a chat than an actual bug in ipyrad so if you want to jump on the ipyrad gitter channel to respond i will close this issue.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.