1000 genomes PCA data - Githubissues

ewels commented 7 years ago

Hi there,

@leipzig requested that I add a module to parse Peddy data for a tool I've written called MultiQC (see multiqc.info and ewels/MultiQC).

Quick question - does the PCA data shown in the background of the PCA Projection onto 1000 Genomes vary between datasets? See ewels/MultiQC#355 for the discussion - basically as these co-ordinates aren't exported by Peddy I've just taken a snapshot of that data, and I want to check that this is valid.

Many thanks,

Phil

brentp commented 7 years ago

Hi, I'm familiar with multi-qc and would love to have peddy included.

The PCA background data does not vary between datasets, but it is, for now, only for hg19/grch37 datasets. Not sure if you found it, but the labels for the populations are: "AFR AMR EAS EUR SAS UNKNOWN" and the indexes are here: https://github.com/brentp/peddy/blob/master/peddy/pca.py#L1 where, e.g. 3 would indicate that sample is EUR.

You could probably just convert that to a list of ancestries, but I uses the numbers as classes for scikit-learn. Hope that helps and let me know any more questions.

brentp commented 7 years ago

and, if there's a way I can distribute/store the data that's more amenable to external use (and still performant), I'm happy to discuss.

ewels commented 7 years ago

Brilliant, thanks for the fast response @brentp. Good to know that the data doesn't vary - I pulled the data from the plot in your example report (see resulting json files), though the UNKNOWN category didn't seem to have any data points?

I hadn't thought about the fact that it was Human only - is the genome reported anywhere in the Peddy output so that I can check this?

This example report shows what I've written so far, based on this example data from @leipzig. If there's anything else that you think I should put into the report then let me know (I don't have any experience of this type of analysis myself).

Cheers,

Phil

ewels commented 7 years ago

ps. The labels are in the line you linked to, but are you generating the PCA co-ordinates differently for every run? I just pulled the [x,y] plot co-ordinates for the MultiQC data, that was the bit I was unsure about. Apologies, my initial post wasn't very clear.

brentp commented 7 years ago

peddy doesn't know the genome, so I can't report it. Everything but ancestry should work on any diploid genome.

So, the ancestry info for the 1KG backgrounds will change each time; every time peddy is run, it subsets the 1KG sites (variants) to those that have been found in that cohort, then runs PCA on that particular subset of variants. (and then projects the incoming cohort onto those PCS). The resulting PCS are always output into the .html file, for the plot. So you could grab it from there.

Also, there is an aggregate file called $prefix.peddy.ped that contains summary info from sex_check, het_check, and ped_check that might be the simplest to pull from. The columns we use the most (other than those for relatedness) are:

het_call_rate
het_ratio
het_mean_depth
ancestry_prediction
sex_het_ratio

ewels commented 7 years ago

Ok great, thanks for the feedback. I'll remove the current bundled data then. I like to avoid parsing data from HTML files where possible (it can get messy quickly) - any chance that Peddy could output this data (type, PC1, PC2 for all points plotted) to a .csv file alongside the others?

I'm currently pulling all of the data in the MultiQC report from $prefix.peddy.ped as you say, with the exception of the sex_error column, which I take from the sex_check.csv file.

We've had some discussion over what to show in ewels/MultiQC#355 - I've already put in ancestry_prediction, sex_het_ratio and sex_error into the General Statistics table at the top of the report. I'd prefer to not add too many more columns to that top table (it can get very wide with other modules adding to it) - do you think it's worth adding a Peddy-specific table with extra columns?

Phil

brentp commented 7 years ago

I'll work on outputting a csv (or I can do JSON) of the 1KG PCs.

A key utility of peddy is the inferred relationships (relatedness, IBS0, IBS2, etc) so if you had a way to optionally plot those (or to show the static png that peddy creates), that would be great. Those values are in ped_check.csv

ewels commented 7 years ago

Thanks @brentp - sounds great! I'm in a bit of a rush to release this version of MultiQC, so I'll add the 1KG PCs in the next release if that's ok (JSON is also good).

I've just tried to replicate your IBS2 v IBS0 plot in MultiQC - doesn't look quite as nice but is hopefully ok as a starting point. I've coloured by relatedness, but not the parent-sib / sib-sib information as I'm not sure where this comes from.

I'll be away for a while after Christmas (hence the push to get a release out), but happy to continue improving the MultiQC output when I'm back. Let me know if I've got anything wrong or if there's anything I can do better in the module.

Cheers,

Phil

brentp commented 7 years ago

Sounds good let's touch base in the new year after I get out the next release that outputs the 1kg pcs. Thanks!

brentp commented 7 years ago

I made this change and pushed a new release (0.2.9) to pypi. It goes to $prefix.background_pca.json. Let me know if you need any changes.

ewels commented 7 years ago

Ah nice - that was super fast, thanks! If you have any example data kicking around it would be great if you could submit it to ewels/MultiQC_TestData so I can see what to parse. Otherwise I'll try to get some myself when I get back in mid February.

Thanks!

Phil

brentp commented 6 years ago

This is now in multiqc.

brentp / peddy

1000 genomes PCA data #20