Variant ID considerations

HallLab / pandas-genomics

Pandas ExtensionDtypes for dealing with genomics data

BSD 3-Clause "New" or "Revised" License

47 stars 8 forks source link

Variant ID considerations #16

Closed jrm5100 closed 3 years ago

jrm5100 commented 3 years ago

The canonical way to identify variants should be consistent, either:

The Series name in a Series or DataFrame
The variant ID associated with the GenotypeArray

Choice 1 would render the variant ID useless when it exists in a dataframe. Choice 2 would require more careful validation of dataframes to avoid duplicate IDs

Either choice requires carefully considering how to name encoded genotype results.

jrm5100 commented 3 years ago

Some updates:

The DataFrame accessor now requires unique variant IDs in the dataframe
Initializing a variant will generate a random (uuid4) id instead of having None