esrud / GONE

GONE: Scripts, programs and an example data set
42 stars 2 forks source link

Help understanding the output, NSNP_monomorphic #34

Open auNathalie opened 9 months ago

auNathalie commented 9 months ago

Dear Armando /or GONe team,

I have GONe running smoothly on my data, thanks to you Armando. I, however, have a few questions regarding the output. Especially, the no. of monomorphic sites (SNPs) found pr. chromosome by GONe.

E.g.


CHROMOSOME 1 NIND(real sample)=20 NSNP=50107 NSNPcalculations=36608 NSNP+2alleles=0 NSNP_zeroes=0 NSNP_monomorphic=13499 NIND_corrected=20.000000 freq_MAF=0.025000 F_dev_HW (sample)=-0.026343 F_dev_HW (pop)=-0.000702 No genetic distances; using 1.000000 cM per Mb


I have removed all non-biallelic sites before creating the map and ped file with Plink. Despite this, GONe finds high rates of monomorphic SNPs. I counted the no. of sites that are monomorphic for the alternative allele, the total is 21985 throughout the genome. Subsequently, I have a hard time understanding what variants are being put into the NSNP_monomorphic category, because the amount found by GONe far exceeds the no. of monomorphic sites (21985) in my data.

I noticed that high rates of monomorphic SNPs are also being found in the example data provided - about 25% of the SNPs pr. chromosome.

• Do you know what these variants are? – are they monomorphic in the example data?

• Further, should I be concerned about the integrity of my ped and map files - since this is not found in the vcf from which the ped and map files are created?

• Since these are not being found in the vcf, do you know of any way, or tricks, to reduce the number of SNPs GONe finds to be monomorphic?

Thank you for your time and assistance.

Best regards, Nathalie

armando-caballero commented 9 months ago

Dear Nathalie, I have double checked with another programme that the number of monomorphic SNPs (in the example file) is correct: There are a total of 100,013 positions along the 10 chromosomes, from which 26,773 are monomorphic in the sample. So that the number of polymorphic ones are 73,240. For the example, the map file comes from a simulation with 1,000 individuals, but the sample has only 100, that is why there are monomorphic SNPs. Could you please check this? Armando.

armando-caballero commented 9 months ago

Dear Nathalie, Note also that the current version of GONE disregards SNPs with missing data in more than 50% of the sampled individuals. These would appear as "NSNP_zeroes". Armando

auNathalie commented 9 months ago

Hi Armando,

I'm sorry it has taken me a bit of time to get back to you.

I checked my data. I wrote a simple bash script to search through each pair of columns, and only found the 21985 monomorphic SNPs.

Now I may have been too hasty running GONe, as I did not apply a minor allele frequency filtering. Once I did, 0.05 - in my setup effectively removing singletons, GONe found no monomorphic SNPs. Please let me know if you think this is not a usable solution.

I'm sorry to have taken up your time on this issue.

Thank you for all your help and consideration.

Very best regards, Nathalie

armando-caballero commented 8 months ago

Dear Nathalie, Applying a MAF=0.05 is fine, as it has very little effect on estimates of Ne. Singletons are not monomorphic SNPs, so GONE should consider them as segregating SNPs. Best wishes, Armando.

auNathalie commented 8 months ago

Dear Armando,

Thank you for your quick reply and all your help.

Best, Nathalie