AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
125 stars 18 forks source link

Improve aggregated_metadata.json ordering #3459

Open davidsmejia opened 7 months ago

davidsmejia commented 7 months ago

Context

When we smash datasets together and create the aggreggated_metadata.json file, we set the keys in alphabetical order.

While this has no impact of the performance one way or the other, it is not unreasonable for a human person to want to review this file by eye and this ordering might not be the best because of the deeply nested values.

Problem or idea

aggregated_metadata.json is currently serialized in alphabetical order:

  aggregate_by // download option
  created_at
  experiments
  ks_pvalue
  ks_statistic
  ks_warning
  num_experiments
  num_samples
  quant_sf_only // download option
  quantile_normalized // download option
  samples
  scale_by // download option

Solution or next step

I think we should place the values that are atomic at the top and then experiments / samples at the bottom.

I am less fond of the idea of nesting all download options together as this would be a breaking change for automated workflows.

@jaclyn-taroni for input

jaclyn-taroni commented 7 months ago

I think we should place the values that are atomic at the top and then experiments / samples at the bottom.

I agree with this approach. Good job anticipating my suggestion about nesting download options together, as well 😄