AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Add developmental stage attribute to sample #3306

Closed davidsmejia closed 11 months ago

davidsmejia commented 1 year ago

Context

We are not currently persisting developmental_stage to the sample in the database. Per the docs we are supposed to be harmonizing this key as refinebio_developmental_stage. This has temporarily been removed from the docs at this time. We do have the functionality of parsing this from the Sample and already to do this in the harmonizer and assign it to the model but on save this value is no persisted to the database.

Problem or idea

Adding the field is pretty trivial, we just need to add something like developmental_stage = models.CharField(max_length=255, blank=True)

The harder part will be backfilling the existing samples and / or possibly just rerunning the harmonizer on all samples in some fashion. This is not computationally difficult but in order to be good citizens we should determine the ideal self imposed rate limit so that we can both accomplish this in a reasonable amount of time and not thrash ENA api endpoints. At this time I have been unable to determine a way to fetch multiple biosample responses in a single query that are tied to a specific study. So while we can fetch an entire experiment's worth of sample metadata we still need to fetch biosample metadata one at a time.

Solution or next step