Feature request: enable creation of custom fields for individual/phenotype/projects

jxchong commented 3 years ago

As discussed offline, our group has been trying to assess how to replicate/port our existing workflows to seqr.

We've found that the data that seqr allows us to enter about a given project/individual/family is currently limited. Our alternative is to operate a second database in parallel to store these pieces of data, which would require a lot of redundancy and duplicating of data entry.

If possible, the ideal would be for admin users to be able to create custom fields for later use in entering individual-level and project level data. If the ability to create custom fields isn't an option, then here are the fields we need:

Alternate Individual ID (e.g. to track investigator's IDs because often the investigator only remembers the sample by their own ID and not the sequencing ID)
Tissue Source
Sample Type
Consent Version
Broad Data Sharing
Age at Evaluation

Allow admins to create custom case/individual review statuses beyond the current set of options (if creating custom review statuses isn't possible, we could come up with a set of fixed option, but I assume custom fields would be more flexible for other groups using seqr).

hanars commented 3 years ago

Hi Jessica, can you refresh my memory about which offline conversation this was? I seem to be missing a bit of context around this request.

In general, we really push back on adding additional custom fields to seqr as it bloats the database and UI for everyone else and makes maintenance difficult. For the alternative individual ID this is something we already support but due to the complexity of changing IDs (even only cosmetically) we restrict that functionality to Project Managers - if you would like to update any of your projects to show different IDs please feel free to reach out to the PM team and they can do that for you. We also already have an "Age of Onset" field that could work for "Age at Evaluation", and we store the "Sample type" associated with the sample that is actually loaded in seqr (shown as a hover over the green dot). We also support a free text note field that could easily include all of the remaining information included here as a text block, both at the family and individual level.

jxchong commented 3 years ago

We (UW) had a conference call with you a few weeks ago :)

If you don't think it's a good idea to add these fields for all seqr users, then we are hoping instead for administrators to have the ability to add custom fields (so all entries in our seqr instance would have these fields but it wouldn't affect other seqr instances).

Age of Onset is different clinically than Age at Evaluation. For example age of onset of symptoms at 15 years of age, and the last evaluation by a doctor was at 37 would matter in particular if a phenotype is progressive/degenerative.

Sticking all these custom field/pieces of data into the free text block isn't ideal because we won't be able to easily search/filter the database for entries matching particular criteria (e.g. consent for broad data sharing or all individuals/projects with a sample that has Tissue type of brain/skin/saliva/etc).

hanars commented 3 years ago

Thanks for the reminder, I remember now :)

Allowing custom database fields to exist in different seqr instances is possible but its not a straightforward change. If UW wants to contribute engineering work (this is an open source project!) or funding for that feature work we can discuss that, but building out that kind of custom functionality that doesn't really benefit any of our direct users is honestly going to be incredibly low priority for us.

I am also a little concerned about your comment about searching the database for records. seqr is not a database, its a platform. At the Broad, we maintain a separate airtable database for tracking sequencing work and consents and that sort of thing. You should never have anyone running SQL queries directly against the seqr database to ensure security and data integrity. If adding these fields is helpful to view in seqr directly (i.e. Age at Evaluation) it makes sense to show it in seqr. But if what you want to do is add a lot of data to seqr to then run SQL queries, I think you should maybe reevaluate some of your workflows. It may be worth having another meeting to taljk through this use case more

jxchong commented 3 years ago

I don't think we were planning to run SQL queries directly but that's interesting. We currently have a separate Redcap database and have found that keeping it in sync is a pain. Do you automatically push new entries from one (seqr) to the other (e.g. Airtable)?

Anyway if you only want to add specific fields that are broadly useful to be seen directly in seqr, then I would suggest:

Alternate Individual ID (e.g. to track investigator's IDs because often the investigator only remembers the sample by their own ID and not the sequencing ID / ancillary files are often labeled with the investigator's ID and not the sequencing ID that would be stored in seqr)
Tissue Source (how are you tracking multiple tissue sources per individual if you do RNA-seq on blood and muscle plus exome/genome?)

hanars commented 3 years ago

We have full time project managers whose job includes keeping seqr updated. But we don't duplicate data per between airtable and seqr for the most part, we just keep different data in its appropriate place. While there is some minimal amount of data that does exist in both places, like sample ids, its not really duplicated. seqr for instance doesn't have a record for samples that we are still awaiting delivery on the biological specimen, those sample IDs are often only added to seqr once we have sequencing data for them and are ready to actually start analyzing them in seqr. Data like consent version and sharing policy are never added to seqr - we never add data to seqr unless we have the proper consent to keep it there, and therefore seqr itself does not need to know anything about those policies. We don't have automation to push data between these, as there are no obvious triggers that we would actually want to have automated updates for

for 1, as I mentioned previously there is already functionality for this although we do restrict it to project managers. You can also use different individual IDs in seqr than VCF IDs. If you do this, you can either provide an id mapping file to the data loading pipeline which will map the VCF IDs to their seqr IDs before exporting the data (our recommended approach), or if you prefer you can export the data with the VCF IDs and then when you go to add the data in seqr there is an option to proved a mapping file between the sample IDs and the seqr individual IDs

for 2, we don't have RNA data in seqr yet so this isn't an issue. When we add support for RNA-seq, we will add a field for tracking tissue source if needed. For now, if we know we have RNA seq data for a sample we add it to the notes for the family in seqr, so the analysts know that there is data available elsewhere if they want to look at it. But for that sort of information, theres no need to have it in a structured format and the note works just fine

jxchong commented 3 years ago

In the past we've wanted to do things like "identify all samples from individuals who consented to public data sharing for which we don't have a candidate gene"

for 1 - if you use this functionality, is the analyst/end user able to see both the investigator/other ID and the original VCF ID?

for 2 - tissue source is useful even aside from RNA-seq data, e.g. if you're looking for somatic mosaicism, you may have WES/WGS on brain, saliva, and blood and you'd of course potentially be searching for a variant that's het in brain but potentially homref in blood

hanars commented 3 years ago

that makes sense, these just aren't questions our team really handles. I will point out that if you have your own deployed instance of seqr you can add your own custom fields to it, and if you did do that and wanted to contribute your changes back to the seqr open source project, we could figure out how to include those. I just want to be up front about the fact that doing feature work that does not benefit most seqr users is not going to be high on our priority list and I can't commit to adding this functionality within the next year at least

1 - If you remap the data at either stage of loading it to seqr, no only the seqr ID is shown. There is the option to make the underlying individual_id different from the display_name, and the display_name is whats shown everywhere in seqr and the individual_id is what used in search and generally is the VCF ID/ remapped ID. So if you did that you could make the individual_id the VCF ID and then use a different display name field. Currently, we only show the display name to users but making it so you can also see the individual_id is a minor change that I'm happy to make

2 - agreed that that information is helpful during analysis, but that kind of information is just as useful as a free text note as structured data. I think its worth pointing out that notes are rich text, so you can bold and use multiple lines to display somewhat structured data to the users. I've attached a screenshot of a note to show what we do for one of our projects

broadinstitute / seqr

Feature request: enable creation of custom fields for individual/phenotype/projects #1758