brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
255 stars 35 forks source link

Add principal component export for bg and sample in ancestry-predict.py #28

Closed tstannius closed 4 years ago

tstannius commented 4 years ago

To enable creation of PCA plots in MultiQC, I have modified the script to export csv files containing the predictions and PC's.

These changes affect the way ancestry-predict is called:

ORIGINAL

python scripts/ancestry-predict.py --labels scripts/ancestry-labels-1kg.tsv --samples $MY_SAMPLES/*.somalier --backgrounds 1kg-somalier/*.somalier > sample-ancestries.txt

Outputs:

NEW

python somalier/scripts/ancestry-predict.py --labels somalier/scripts/ancestry-labels-1kg.tsv --samples $MY_SAMPLES/*.somalier --backgrounds 1kg-somalier/*.somalier

Shows the plot

python3 somalier/scripts/ancestry-predict.py --labels somalier/scripts/ancestry-labels-1kg.tsv --samples data/*.somalier --backgrounds 1kg-somalier/*.somalier --plot mydir/mysample.pdf

Outputs in dir: "mydir":

Considerations

brentp commented 4 years ago

this looks good to me. if you'd add a --prefix argument that defaults to something like "somalier-ancestry" use that, and then update the README.md as needed, I think this would be ready.

tstannius commented 4 years ago

Done and done.

However, a co-worker suggested some improvements that I will add.

tstannius commented 4 years ago

Is there anything that needs changing @brentp or should I consider the edits final? Then I will continue working on extending your MultiQC/Somalier PR to accommodate the ancestry-prediction :-)

brentp commented 4 years ago

thanks for the reminder. FYI, I am working on getting this functionality directly into somalier binary. it's working but lacking a few features it outputs the full text for background and query samples, including the PCs and the confidence for each ancestry. that will probably be a better place to start.