Closed HanXiaoEvo closed 4 years ago
Hello,
Can I assume you are using the latest Stacks v2.54
or ta least >2? If so, my script is no longer necessary since the populations
program can produce the output needed by fineRADstructure
(I wrote this for pyrad/ipyrad
or the old Stacks
):
This is taken from the help of the populations
program (https://catchenlab.life.illinois.edu/stacks/comp/populations.php):
--radpainter — output results in fineRADstructure/RADpainter format.
What is the problem you are having with the working environment? Also, is not number of SNPs at a locus but number of samples per locus (i.e., how many samples must be present in a locus to count it as valid, just a way to control the amount of missing data).
Edgardo
Hi Edgardo,
Thank you very much for your quick reply!
Yeah I am using Stacks v2.3 I think and the output works. However in the stacks output there is no header about chrom/position info. As my data are pair-end reads, I am just wondering if it matters for the -c estimation and how to take the paired reads into account.
For the missingness, in Stacks I call loci that present in at least 66% individuals per populations. So the loci missingness should be no more than 34%. And some individuals the missingness is just 3%. However in fineRAD I checked the missingness output and the missingness is much higher, that is why I am asking. Hope my explanation is clear, or I can provide some examples. Thank you very much again!
Cheers, Han
As for the missingness, you must be missing some other parameters in populations
like -r
or -R
.
As for the other question about the headers I still don't get it... what is -c
? In which output file are you expecting such headers? Why would they matter for fineRADstructure
? The haplotypes should come from a locus where pairs are linked, but I don't know how the latest version of Stacks built loci from paired-ended data. Perhaps you can extract a locus and see if it comes from paired reads or not?
Without knowing the commands you used in which programs is harder to diagnose the issue.
Edgardo
Hi again, sry that the description is a bit messy :P Here I show it in details:
I do use -r, the parameters I used are here: -p 3 -r 0.66 --min-mac 3 --max-obs-het 0.6 -H --fstats --ordered-export --vcf --genepop --plink --radpainter, I have 4 populations, so I ask a locus must present in 3 of 4 populations (p3) and 66% individuals per population (r 0.66), for the missingness difference, will show you in 4.
For the header of populations.haps.radpainter, it has no prosition info, like this: ThSB_36 ThSB_37 ThSB_38 ThSB_39 ThSB_40 ThSB_41 ThSB_42 ThSB_43 ThSB_44 ThSB_45 ThSB_46 ThSB_47 ThSB_48 ThSB_49 ThSB_50 ThSB_51 ThSB_52 ThSB_53 ThSB_54 ThSB_55 ThSB_56 ThSB_57 ThSB_58 ThSB_59 ThSB_60 ThSB_61 ThSB_62 ThSB_63 ThSB_P01 ThSB_P02 ThSB_P03 CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CAGTTT/CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CAGTTT/CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CAGTTT/CTGTGT CTGTGT
But in the eaxample you posted it has:
So when I ran this output file in fineRAD, it reports: No location ("Chr") info found - assuming the input is a simple data matrix The file seems to be in a SimpleMatrix format
So I can not make sure how the locus are arranged and if the pair-ends are linked or not.
Thank you very much again!
Cheers, Han
Hi, Could you show me the example I posted? I am confused, I don't remember posting any examples of the matrix.
If fineRAD
runs then don't worry, it seems that the message about the location is just a warning and it detects it as a SimpleMatrix
which I guess is one of the formats it supports. I don't know if the lack of location affects the calculations in fineRAD
, you could send an email to the authors.
OK, so I got the example from https://cichlid.gurdon.cam.ac.uk/fineRADstructure.html Input format: The input file (INPUT_RAD_FILE.txt) should be in one of the three following fomats:
Stacks export_sql.pl output (-a haplo -o tsv -F snps_l=1): Example
I contacted Milan but no reply....thank you all the same and I will try to contact him again!
Cheers, Han
Ah OK, I hadn't checked that page in a while. Also, how good is your draft genome?, perhaps the location doesn't matter much if the draft is very fragmented and you can use this other script they recommend in their website:
Important note: If you have a reference genome, the RAD loci should ideally be ordered according to genomic coordinates. If you have unmapped loci, we provide a script sampleLD.R that can reorder loci according to linkage disequilibrium (LD). If LD is strong and loci are not sorted, this could lead to overconfident clustering. Therefore, we recommend using the sampleLD.R script before running RADpainter to users with unmapped data who want to be extra careful to ensure they obtain a CONSERVATIVE upper bound on the number of statistically identifiable clusters. Example command line: Rscript sampleLD.R -s 1 -n 500 INPUT_RAD_FILE.txt INPUT_RAD_FILE_reordered.txt. This should do the job. If you want to understand the options, they are described in the R script.
e
Well the genome is in chromosme level but sure I can try this script. I just have no idea why the standard stcaks output for fineRAD has no headers for location etc.
Well, one last option (if you can skip Stacks) would be to process your samples in dDocent ( https://www.ddocent.com/ ), that pipeline uses freebayes
to produce the VCF which by default is phased ( https://groups.google.com/g/freebayes/c/fyhho8_H7J0?pli=1 ), although I am not 100% sure that it makes phased calls for every SNP. Then you could use hapsFromVCF
from https://github.com/millanek/fineRADstructure to get your matrix
e
Thank you very much still! However, that is too much work and everything starts from the beignning.
Hi dear people,
I have a few questions regarding the input file from STACKS. My data is from ddRAD, pair-end reads, aligned to a draft genome. I noticed that in the input file made by Stacks there is no info of chrom, I am wondering if the alignment information is still there. I used ordered-export but not sure it is totally fine also because of the paired reads. I asked in the Stacks group but no one answered.
I tried hapsFromVCF to make the input but the result is rather creepy. So far now I can not use the python script because of the working environment. However it is mentioned that we can specify the maximum number of SNPs allowed at a locus, is it possible to do it using the Stacks output? Or I have to convert the data and do it through the script?
Thank you very much and look forward to your ideas!
Cheers, Han Xiao