HanXiaoEvo commented 4 years ago

Hi dear people,

I have a few questions regarding the input file from STACKS. My data is from ddRAD, pair-end reads, aligned to a draft genome. I noticed that in the input file made by Stacks there is no info of chrom, I am wondering if the alignment information is still there. I used ordered-export but not sure it is totally fine also because of the paired reads. I asked in the Stacks group but no one answered.

I tried hapsFromVCF to make the input but the result is rather creepy. So far now I can not use the python script because of the working environment. However it is mentioned that we can specify the maximum number of SNPs allowed at a locus, is it possible to do it using the Stacks output? Or I have to convert the data and do it through the script?

Thank you very much and look forward to your ideas!

Cheers, Han Xiao

edgardomortiz commented 4 years ago

Hello,

Can I assume you are using the latest Stacks v2.54 or ta least >2? If so, my script is no longer necessary since the populations program can produce the output needed by fineRADstructure (I wrote this for pyrad/ipyrad or the old Stacks):

This is taken from the help of the populations program (https://catchenlab.life.illinois.edu/stacks/comp/populations.php): --radpainter — output results in fineRADstructure/RADpainter format.

What is the problem you are having with the working environment? Also, is not number of SNPs at a locus but number of samples per locus (i.e., how many samples must be present in a locus to count it as valid, just a way to control the amount of missing data).

Edgardo

HanXiaoEvo commented 4 years ago

Hi Edgardo,

Thank you very much for your quick reply!

Yeah I am using Stacks v2.3 I think and the output works. However in the stacks output there is no header about chrom/position info. As my data are pair-end reads, I am just wondering if it matters for the -c estimation and how to take the paired reads into account.

For the missingness, in Stacks I call loci that present in at least 66% individuals per populations. So the loci missingness should be no more than 34%. And some individuals the missingness is just 3%. However in fineRAD I checked the missingness output and the missingness is much higher, that is why I am asking. Hope my explanation is clear, or I can provide some examples. Thank you very much again!

Cheers, Han

edgardomortiz commented 4 years ago

As for the missingness, you must be missing some other parameters in populations like -r or -R.

As for the other question about the headers I still don't get it... what is -c? In which output file are you expecting such headers? Why would they matter for fineRADstructure? The haplotypes should come from a locus where pairs are linked, but I don't know how the latest version of Stacks built loci from paired-ended data. Perhaps you can extract a locus and see if it comes from paired reads or not?

Without knowing the commands you used in which programs is harder to diagnose the issue.

Edgardo

HanXiaoEvo commented 4 years ago

Hi again, sry that the description is a bit messy :P Here I show it in details:

I do use -r, the parameters I used are here: -p 3 -r 0.66 --min-mac 3 --max-obs-het 0.6 -H --fstats --ordered-export --vcf --genepop --plink --radpainter, I have 4 populations, so I ask a locus must present in 3 of 4 populations (p3) and 66% individuals per population (r 0.66), for the missingness difference, will show you in 4.
For the header of populations.haps.radpainter, it has no prosition info, like this: ThSB_36 ThSB_37 ThSB_38 ThSB_39 ThSB_40 ThSB_41 ThSB_42 ThSB_43 ThSB_44 ThSB_45 ThSB_46 ThSB_47 ThSB_48 ThSB_49 ThSB_50 ThSB_51 ThSB_52 ThSB_53 ThSB_54 ThSB_55 ThSB_56 ThSB_57 ThSB_58 ThSB_59 ThSB_60 ThSB_61 ThSB_62 ThSB_63 ThSB_P01 ThSB_P02 ThSB_P03 CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CAGTTT/CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CAGTTT/CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CTGTGT CAGTTT/CTGTGT CTGTGT

But in the eaxample you posted it has:

Catalog ID Annotation Chr BP Consensus Sequence Num Parents Num Progeny Num SNPs SNPs Num Alleles Alleles Deleveraged MLL1 MLL2 MLL3 MLL4 MLL5 MLL6 MLL7 MLL8 MLL9 MLN1 MLN2 MLN3 MLN4 MLN5 MLN6 MLN7 MLN8 MON7 MON8 MON9 MON10 MON11 MON12 MON13 MON14 MON15 MON16 MON17 MON18 MON19 MON20 MOL1 MOL2 MOL3 MOL4 MOL5 MOL6 MOL7 MOL8 BAN1 BAN4 BAN5 BAN6 BAN7 BAN8 BAN9 BAN10 BAL1 BAL2 BAL3 BAL4 BAL5 BAL6 BAL7 BAL8 BAL9 BAL10 DS3 DS5 DS6 DO11 DO13 DS26 FL8 FL9 FL11 FL12 FL19 FL23 FL25 NB10 NB20 NB24 LU1

So when I ran this output file in fineRAD, it reports: No location ("Chr") info found - assuming the input is a simple data matrix The file seems to be in a SimpleMatrix format

So I can not make sure how the locus are arranged and if the pair-ends are linked or not.

for the missingness, it seems that the calculation for SNP missingness and haplotype missingness is different. I just calculated individual missingness using vcftools for the haplotype.vcf and found similar missingness results as finerAD now.

Thank you very much again!

Cheers, Han

edgardomortiz commented 4 years ago

Hi, Could you show me the example I posted? I am confused, I don't remember posting any examples of the matrix.

If fineRAD runs then don't worry, it seems that the message about the location is just a warning and it detects it as a SimpleMatrix which I guess is one of the formats it supports. I don't know if the lack of location affects the calculations in fineRAD, you could send an email to the authors.

HanXiaoEvo commented 4 years ago

OK, so I got the example from https://cichlid.gurdon.cam.ac.uk/fineRADstructure.html Input format: The input file (INPUT_RAD_FILE.txt) should be in one of the three following fomats:

Stacks export_sql.pl output (-a haplo -o tsv -F snps_l=1): Example

I contacted Milan but no reply....thank you all the same and I will try to contact him again!

Cheers, Han

edgardomortiz commented 4 years ago

Ah OK, I hadn't checked that page in a while. Also, how good is your draft genome?, perhaps the location doesn't matter much if the draft is very fragmented and you can use this other script they recommend in their website:

Important note: If you have a reference genome, the RAD loci should ideally be ordered according to genomic coordinates. If you have unmapped loci, we provide a script sampleLD.R that can reorder loci according to linkage disequilibrium (LD). If LD is strong and loci are not sorted, this could lead to overconfident clustering. Therefore, we recommend using the sampleLD.R script before running RADpainter to users with unmapped data who want to be extra careful to ensure they obtain a CONSERVATIVE upper bound on the number of statistically identifiable clusters. Example command line: Rscript sampleLD.R -s 1 -n 500 INPUT_RAD_FILE.txt INPUT_RAD_FILE_reordered.txt. This should do the job. If you want to understand the options, they are described in the R script.

e

HanXiaoEvo commented 4 years ago

Well the genome is in chromosme level but sure I can try this script. I just have no idea why the standard stcaks output for fineRAD has no headers for location etc.

edgardomortiz commented 4 years ago

Well, one last option (if you can skip Stacks) would be to process your samples in dDocent ( https://www.ddocent.com/ ), that pipeline uses freebayes to produce the VCF which by default is phased ( https://groups.google.com/g/freebayes/c/fyhho8_H7J0?pli=1 ), although I am not 100% sure that it makes phased calls for every SNP. Then you could use hapsFromVCF from https://github.com/millanek/fineRADstructure to get your matrix

e

HanXiaoEvo commented 4 years ago

Thank you very much still! However, that is too much work and everything starts from the beignning.

edgardomortiz / fineRADstructure-tools

Questions regarding imput file generation #4