arvkevi / ezancestry

Easy genetic ancestry predictions in Python
https://ezancestry.streamlit.app
MIT License
63 stars 11 forks source link

How we can use our training set? #24

Open Siavash-cloud opened 2 years ago

Siavash-cloud commented 2 years ago

Hello @arvkevi, Thank you for providing this software. I wonder how we can use another training set (instead of the 1000 genome) in your software? Regards, Siavash

arvkevi commented 2 years ago

Hi @Siavash-cloud thank you for checking out the repo. Great suggestion, do you know if gnomAD has sample level ancestry data available? Or is there another data source you were thinking about?

Siavash-cloud commented 2 years ago

Hi @arvkevi , Thanks for your reply. Yes, you also can add the HGDP dataset (https://www.internationalgenome.org/data-portal/data-collection/hgdp) to your default reference panel (training set). However, I meant if you can consider it as an option (people or companies that want to use their private training set/ people or companies that want to use public data set) in your software. As you may know, the 1000 genomes+HDP have a limited data set for some populations (for example Colombian, n=7) which can affect the prediction accuracy. Additionally, your software might be used for other organisms (such as horses) if you put the above-mentioned option in your software. Also, I wonder if I can use another list of SNPs instead of Kidd et al. 2014 and Kosoy et al. 2009 SNP lists in your software? To the best of my knowledge, Kidd et al. 2014 selected these SNPs based on fixation index (Fst) of SNPs in their dataset (limited number of populations, limited number of individuals per population) which using those SNPs can decrease the predictive ability in ancestry prediction of other populations (populations that are not included in 1000 genomes or HDP). Regards, Siavash

arvkevi commented 2 years ago

I think it would take a bit of work to incorporate additional (or custom) reference genomes. But it's probably worth pursuing, I'll take a look at what it would take to get that functionality in.

Users can use custom snps by using the build-model command.