blachlylab / mucor3

Parses VCF data into tabular spreadsheets and aggregates data by sample
MIT License
0 stars 0 forks source link

sample_indexer hardcodes some columns #8

Closed charlesgregory closed 2 years ago

charlesgregory commented 2 years ago

We should expect to always have a "sample" columns in our data file(s) unless it is a pivoted table, but we have no expectations for the index/key file to have specific columns. I think it currently relies on the columns "accession" and "status" being present? So we really need two flags instead of just --column or --column needs to take a parsed value like colname1=colname2.

Since our index/key table could be as such in our samples.xlsx: Sequencing ID clinical_trial name
sample1 SAM0001
sample2 SAM0002
... ...

In which case we would need to run the script as such:

sample_indexer -f data.tsv -i samples.xlsx -c "Sequencing ID=clinical_trial name" -o converted.tsv
# or 
sample_indexer -f data.tsv -i samples.xlsx --from "Sequencing ID" --to "clinical_trial name" -o converted.tsv
charlesgregory commented 2 years ago

We also shouldn't assume "CLL" to be present in the sample name in the data table. If the "sample" column isn't present in the data file to be converted, then we can assume it is a pivoted table where the sample names to be converted are in the column names.

Kekananen commented 2 years ago

Before I start editing the script. I need to know all the variables ahead of time like this. Are there any more files we are using as the index? I was under the impression prior there was only one master file of indices. Is this being used to edit other file types besides those that we talked about prior that break the pattern set (there will only be a single column or a header needing conversion but not both)? I know you brought up this idea before with pivot tables in the meeting we three had, but it seemed like given the information at the time it was fine to set a single --column variable. Let me know if I misunderstood.

Kekananen commented 2 years ago

I also think both of these scenarios you posted commands for are already working with the current script? We should discuss this on Monday.

Kekananen commented 2 years ago

@charlesgregory Alright, I've actually done some updates on it since I had to add in the correct method for splitting or not splitting in conversion. I've just gone ahead and added in the flags you wanted, but please let me know if there are more specs to include. Otherwise this can likely be closed.