chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
37 stars 23 forks source link

Update requirements for feature_length #778

Closed brianraymor closed 3 weeks ago

brianraymor commented 6 months ago

Context

Should thefeature_length requirements be updated per the conversation in #single-cell-platform.

@pablo-gar's Feb 20 2024 summary:

The Comp Bios had a conversation about this (Genevieve, Katrina, Dan, and Sidney).

  1. This is the summary:
  2. The method "merged length of all isoforms" is a commonly used method to the best of our knowledge.
  3. The method "median" may be sensitive to groups of outliers, but in theory it can be a good alternative to 1).
  4. The motivation to change the method is to provide more stability of gene lengths over migrations -- there is a hypothesis that "merged length of all isoforms" is less stable than "median". However we have no data to indicate that "median" will be more stable.
  5. Users who find our all calculations of gene length to be not ideal can calculate genes lengths themselves if needed. We anticipate this number of users is pretty low since "merged length of all isoforms" is a relatively common practice.
  6. Gene lengths are only useful in the context of Smart-seq data, which is <5% of our data. This is a low priority item.

We suggest to not change the method until:

  1. We test "median" is more stable over migrations. Test: compare delta of gene lengths for each method, comparing schema 4.0 vs 5.0 artifacts.
  2. We test "median" is a good alternative method by checking it is not too sensitive to groups of outliers -- this is our main concern with this method.
  3. We (comp bio team) can test this sometime later in Q2 (we are currently maxed out for this quarter, and this is low pri), so it would be done for the next migration(s).
pablo-gar commented 4 months ago

@brianraymor and @jahilton

We are likely to get a contractor to work on this task late Q2 or early Q3. We'll share more as we know more. cc @sidneymbell

brianraymor commented 3 weeks ago

@pablo-gar added a new issue - #960. Closing as duplicate.