Closed billingross closed 1 month ago
Each input file is a gzipped tsv.
Read each table and convert to a pandas dataframe.
Pandas tutorial for reading tabular data.
Pandas read_csv docs.
How should I combine these dataframe? There are (3) methods available in pandas: merge, join, and concatenate.
I think I want either merge
or join
but I'm not sure what the difference between these methods is.
Used this solution to join all dataframes using pandas.merge()
and reduce
.
Allele counts table columns:
I think I want to use the pandas apply() function to count alleles for each sample.
Example of how to use the apply()
function.
I'm thinking that I could use apply to create (2) dataframes that separately contain minor and major allele counts generated using the apply
function.
More apply
examples: https://www.digitalocean.com/community/tutorials/pandas-dataframe-apply-examples.
And I think I can apply this with the filter
command to only apply allele counting functions to sample columns.
Based on this post, I think I can use the pandas.DataFrame.filter method to split columns based on regex.
[x] Read files and merge all of them input single table structure
[x] Update sample "HG10101" to be "HG00101" instead
[x] Create a new table with allele counts
[x] Create an "output" directory and write samples to separate files based on sample IDs
N columns should be (9) variant info columns plus N sample
Variant info columns: #CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT
Update sample "HG10101" to be "HG00101" instead.