billingross / data-manager-challenge

0 stars 0 forks source link

Create script to read, parse, clean input data #2

Closed billingross closed 1 month ago

billingross commented 1 month ago
billingross commented 1 month ago

Each input file is a gzipped tsv.

billingross commented 1 month ago

Read each table and convert to a pandas dataframe.

Pandas tutorial for reading tabular data.

Pandas read_csv docs.

billingross commented 1 month ago

How should I combine these dataframe? There are (3) methods available in pandas: merge, join, and concatenate.

I think I want either merge or join but I'm not sure what the difference between these methods is.

billingross commented 1 month ago

Used this solution to join all dataframes using pandas.merge() and reduce.

billingross commented 1 month ago

Allele counts table columns:

billingross commented 1 month ago

I think I want to use the pandas apply() function to count alleles for each sample.

Example of how to use the apply() function.

I'm thinking that I could use apply to create (2) dataframes that separately contain minor and major allele counts generated using the apply function.

More apply examples: https://www.digitalocean.com/community/tutorials/pandas-dataframe-apply-examples.

And I think I can apply this with the filter command to only apply allele counting functions to sample columns.

billingross commented 1 month ago

Based on this post, I think I can use the pandas.DataFrame.filter method to split columns based on regex.