MSRUN Data loading - Githubissues

Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab

MIT License

4 stars 1 forks source link

MSRUN Data loading #45

Closed jcmatese closed 3 years ago

jcmatese commented 3 years ago

FEATURE DESCRIPTION

Feature Inspiration

Load the example data into the MsRun, PeakGroup, and PeakData model classes (and perhaps Protocol)

Feature Description

Update the msrun model, removing the unique name attribute, and adding a researcher attribute (string, required).

Parse and load example peak data from Accucor export and a minimum of command Line arguments.

CLI will take a

xlsx file (accucor format, required) with pandas.read_excel?
date (default to today) (required)
protocol either a integer (protocol.id) or name; required; if non-integer, get_or_create [e.g. capture and insert a new protocol based on command line arguments]
researcher name

msrun.name is currently modeled as per-sample, with unique name. To avoid duplicating sample/date propagation into the old name field, we will just drop the name and make a unique constraint on the attribute, and see if that is viable/useable.

msrun.researcher will be added also, for further classification.

Alternatives Considered

file argument(s) : two .tsv files (original and corrected), or

unspecified protocol insertion : find/insert a generic LC-MS protocol, of the "please specify me later" variety?

Comment

Add any other context or screenshots about the feature request here.

ISSUE OWNER SECTION

Assumptions

Requirements

Accucor xlsx output dependent
sample names are unique with the file
sample names must already be in the database (preloaded; no insertions performed)
compound labels must already be in the database (preloaded; no insertions performed)

Limitations

will not load samples, compounds
no facility for association of raw data files (.raw or .mzxml)

Affected/Changed Components

models.py will be edited (msrun)
utils.pl with be augmented with a new loader class

DESIGN

GUI Change description

command line interface

Code Change Description (Pseudocode optional)

likely all new files and classes

Tests

A test should be planned for each requirement (above), where possible.

jcmatese commented 3 years ago

@lparsons @hepcat72 @fkang-pu If you have strong opinion on the three "Alternatives to Consider" questions, please comment, otherwise I will be the "decider"

hepcat72 commented 3 years ago

I'd like to see the issue owner section fleshed out, in order to fully weigh in. However, these are my inclinations:

Regarding the input, the only concern I have is that if we take .tsv files, I would take N tsv files and determine whether each one is corrected or original (or neither) based on column numbers/content. IMHO, requiring a separate/appropriately-selected options just introduces chances for potential user error. I prefer to eliminate sources of user error. Taking excel would likely eliminate that source of error too.

Regarding missing required data such as protocol (& msrun? - which has a name), I'm actually concerned that the necessity to work around this missing data represents a design flaw, and that if we have a design flaw, it could bode for more unforeseen issues down the line. I could be wrong about the msrun info omission - so correct me if I'm wrong. But assuming I understand it correctly, if I had my druthers, I would think that we should re-evaluate this organization of the database model and answer the questions:

"What purpose does storing MSRun & protocol serve?" (I don't mean to imply they're not necessary. I'm suggesting that the answer may inform where it should occur in the model.)
"How will these datums be used in the interface?"
"Will they be used, e.g. for searching, analysis, plotting, etc?" Organizationally/data-relationships-wise, as I understand it, we simply need a way to know which peak groups are from the same MS plot. That, to me, is the important thing (more important than protocol). Do we have that data? I was under the impression that we didn't. Can it be accurately inferred given any of the data in the file? Does accucor or any other machine-generated file contain this information?

Druthers aside, if we don't capture protocol here and there's no concern regarding the msrun info, I am doubtful protocol would ever get added after the fact, so I would require a command line option for protocol. Ideally, it would be an enumeration and we could add an item in the enum as "unspecified" or "unknown". This would allow us a trigger to create a "to-do" notification associated with records to modify/add a protocol.

And incidentally, do you have any suggestions on how to improve the issue template? I noticed that it was modified. I would have thought these elements would have been in the issue owner section.

lparsons commented 3 years ago

Regarding the input, the only concern I have is that if we take .tsv files, I would take N tsv files and determine whether each one is corrected or original (or neither) based on column numbers/content. IMHO, requiring a separate/appropriately-selected options just introduces chances for potential user error. I prefer to eliminate sources of user error. Taking excel would likely eliminate that source of error too.

The headings of the files could be used to determine if they are "original" or "corrected". Additionally, having named arguments on the command line would greatly reduce the potential for error. That said, if it's easy to read the Excel file, then that would be a nice solution as well. I'm happy either way, both versions of the example data are available.

Regarding missing required data such as protocol (& msrun? - which has a name), I'm actually concerned that the necessity to work around this missing data represents a design flaw, and that if we have a design flaw, it could bode for more unforeseen issues down the line. I could be wrong about the msrun info omission - so correct me if I'm wrong. But assuming I understand it correctly, if I had my druthers, I would think that we should re-evaluate this organization of the database model and answer the questions:

"What purpose does storing MSRun & protocol serve?" (I don't mean to imply they're not necessary. I'm suggesting that the answer may inform where it should occur in the model.)

"How will these datums be used in the interface?"

"Will they be used, e.g. for searching, analysis, plotting, etc?" Organizationally/data-relationships-wise, as I understand it, we simply need a way to know which peak groups are from the same MS plot. That, to me, is the important thing (more important than protocol). Do we have that data? I was under the impression that we didn't. Can it be accurately inferred given any of the data in the file? Does accucor or any other machine-generated file contain this information?

I agree that we should re-evaluate the MSRun and Protocol tables, though I don't think that needs to be done before data is loaded, it can be done later with minimal rewriting. Focusing on the purpose and use of those tables in the interface, etc. are great ways to help us think about them.

We need to associate a PeakGroup with a Sample and a Compound. Right now, MSRun is how we make the association of a PeakGroup to a Sample. The only information I see missing is the Protocol (which seems of limited value atm) and details about the MSRun (date and a name). My suggestion would be to have a parameter that asks for the MSRun name and an optional date (could leave date null perhaps?). Either leave the Protocol out or fill in a default one.

We can then address our concerns about the model in the next phase of development, once we have something working.

Druthers aside, if we don't capture protocol here and there's no concern regarding the msrun info, I am doubtful protocol would ever get added after the fact, so I would require a command line option for protocol. Ideally, it would be an enumeration and we could add an item in the enum as "unspecified" or "unknown". This would allow us a trigger to create a "to-do" notification associated with records to modify/add a protocol.

I'm not sure about the difference between "unspecified" and null, to me either is fine. As long as we can identify those cases later.

And incidentally, do you have any suggestions on how to improve the issue template? I noticed that it was modified. I would have thought these elements would have been in the issue owner section.

I don't see any obvious modifications, but I'd suggest that be discussed on Slack or in a separate issue.

jcmatese commented 3 years ago

OK I will try the xlsx direct reading, which will restrict this as this as "Accucor dependent".

CLI will take a

xlsx file (accucor format, required)
date (default to today) (required)
protocol either a integer (protocol.id) or name; required; if non-integer, get_or_create
name/prefix for construction the msrun.name (optional?)

msrun.name is currently modeled as per-sample, with unique name. For this purpose, Name will likely be constructed as name/prefix + date + sample_name, but could perhaps be reduced to date + sample_name

lparsons commented 3 years ago

OK I will try the xlsx direct reading, which will restrict this as this as "Accucor dependent".

If you put the majority of the logic in utils.py and/or the models themselves, the script should mainly just convert the data into a dict which is passed along to the parsing methods. See load_samples.py.

hepcat72 commented 3 years ago

Just to document the discussion points from the meeting:

primary purpose: The primary purpose of specifying an MSRun, is to support lookup of mzXML/raw files in order to extract unanalyzed data.
secondary purpose: A secondary/unimportant purpose would be reproducibility - to find the mzXML/raw files so that the results can be reproduced. This is why capturing the protocol is not terribly important, and is secondary.

The way they currently look up mzXML files in the lab (in order to extract unanalyzed data) flows like this:

A sample in an analysis is noted to have an excluded compound.
The user who generated the analysis is consulted to request the raw data
The user looks in their lab notebook and notes the date, samples, and other non-codified/universal/uniform information and uses that to find the associated files. Date seems to be a primary means of identification, but the file names can sometimes infer the sample.

Some details:

1 "run" is a single "raw" file for a single "sample" but you can get multiple mzXML files from a "run"
- One example of this is that the raw file stores the run modes of negative and positive, but an mzXML file can only hold either negative or positive, not both.
- There may be also other datums in a raw file that are "combined" into a single representation in an mzXML.
Everything in 1 accucor file comes from more than 1 mzXML file

hepcat72 commented 3 years ago

I thought that protocol would be optional and msrun name would be required?

jcmatese commented 3 years ago

a msrun is currently modeled as

1 sample (required), run on
1 date (required), with
1 protocol (required)

[although we don't actually declare a unique constraint from these three, in the model]

Our only constraint is to declare that the msrun.name is unique, but where this issue is concerned (Accucor data loading), the derived data are compiled from multiple ms-runs. I think our example Accucor files are probably 44-57 MSRuns. There is no way to easily require distinct run names on the command line, beyond perhaps allowing a name prefix (e.g. -msrun_prefix=mneinast) and have the full unique name concatenated with the other attributes "mneinast BAT-xz971 on 2021-04-21", or name-convention generation like we attempt for HTSEQ assays. I was going to support that (optional prefix), but if left blank, default to "BAT-xz971 on 2021-04-21" (sample.name + msrun.date)

If there is a "default msrun naming convention" you would like to initiate, now is a good time to specify.

I was just going to allow the construction of a protocol via name only (or fetch by ID/name). [i.e. required, better than null, and I would not have to auto-insert "dummy protocol data" in the code]. Unless ya'll want me to default to "Generic LC-MS MS Run" in the code base, instead of on the command line...

hepcat72 commented 3 years ago

[First, our process is to approve change requests (i.e. issues) before work begins. Technically, what I conveyed with the rocket mechanism, is that that's once we have 2 rockets on an issue, but in practice, we've been informally approving in our developer meetings, which we haven't had since discussing this in the Tracebase meeting. So let me confirm - are you waiting to start implementation before we've worked through the details and have the 2 rockets? Because that's why I haven't clicked a rocket yet. I've been trying to figure out if we can work out the design issues so that we have an agreement. At least, I have been waiting for either 2 rockets or explicit agreement in the developer meeting before starting implementation. I was under the impression we're all trying to do that, at least.]

We did established that we would not be changing the model and that we need to be able to read the file without this extra info, so as far as rockets are concerned, I'd say that work can begin, but WRT the extra info and command line options, I think we aren't clear yet - at least, I'm not - I feel like there's a potential problem to resolve.

I guess I didn't accurately understand the relationships between an MSRun entry and an accucor(/(el-?)mavel?) file (being loaded) from the meeting. I knew that an accurcor output included multiple mzXML file inputs and that in order to look those (or the raw file) up to extract unanalyzed data, they needed a name and/or date. I was under the impression that there was only 1 date and I think I assumed 1 name - and I assumed that there would be 1 when a file was loaded. Perhaps there are 2 different contexts WRT name & date ralated the "run"?

I explicitly asked if the date should be date of entry/load and Lance said no. He said that the date of the "run"(?) would likely not be that. So it seems to me that this option should not have a default and should be required. And which date field are you entering (by default) "today" into? If date refers to MSRun (and not, e.g. to say, the run of accucor/el-maven), where does the date go and should there be multiple?

And I was under the impression that with each date entry, there would be a name.

In their manual process, he referred to 1 date/name for looking up an mzXML/raw file in his notes on "the run" (which is the critical functionality we need to support). And he said that they typically record this information in their regular activities, so that info should be supplied if we're to mimmic their processes - and would mean that both name(s?) and date(s?) would be known at the time of upload. Perhaps in the context of accucor/el-maven output, he was using "run" to refer to the run of that "analysis"? ...which would mean we should have a place other than the MSRun table to store the date(s)?

If I'm just completely off on all this, then proceed and I'll catch up in the next meeting.

jcmatese commented 3 years ago

I can drop the default date. It was just an idea for user convenience, that's all.

You can make an argument that there should be multiple dates as inputs, based on when the samples were run individually. However, AccuCor does not have this data. The batch loading of MsRuns based on a single AccuCor analysis file as input is largely a heuristic, "close-enough" process. AccuCor does not record or define the msrun name, date, the file of origin, or the protocol. Even the one thing it does provide, the sample name, must match a preloaded sample in TraceBase. So essentially every one of the tracebase.msrun attributes (except for the sample) that would be inserted would be either autogenerated (if unique, such as name), or assumed global for that batch (protocol, date).

If you want to accurately record the researcher's notebook, as opposed to just loading this test AccuCor data into the database, you would have to create a spreadsheet-type batch loader and try to link the two files (msrun batchfile and AccuCor resultfile) by shared/equivalent sample name, I guess. As far as I know, we don't have that example spreadsheet (msrun batch file).

lparsons commented 3 years ago

I think asking for a single MSRun name and date at this point would suffice. Most of the time, all samples in an AccuCor file would have been run at the same time, or at least we could make that a requirement for now. We could extend this at some point to allow users to upload AccuCor files with samples from multiple MSRuns.

Doing it this way would equate to an MSRun being a set of samples all run at the same time, serially, and not an individual sample's run through the mass spec. Since samples are run in batches, this makes sense to me.

hepcat72 commented 3 years ago

I think asking for a single MSRun name and date at this point would suffice. Most of the time, all samples in an AccuCor file would have been run at the same time, or at least we could make that a requirement for now. We could extend this at some point to allow users to upload AccuCor files with samples from multiple MSRuns.

Yes, this was my understanding from the meeting. "Most of the time", there would be a single name/date associated with the runs being analyzed and that there could be more than 1 name/date, but that it would be rare.

So I guess technically, the single entered MSRun record (including name & date) would be a stand-in for multiple actual MSRuns, and that at least for now, we would require that to be the case. So I agree with this path forward.

Doing it this way would equate to an MSRun being a set of samples all run at the same time, serially, and not an individual sample's run through the mass spec. Since samples are run in batches, this makes sense to me.

Agreed.

jcmatese commented 3 years ago

Sounds like you two wish to change the msrun model? [Instead of the solution I proposed above, that accommodates the current model and might get the data loaded]. Again, current model has msrun name as unique, and has a single sample associated with it.

lparsons commented 3 years ago

Sounds like you two wish to change the msrun model? [Instead of the solution I proposed above, that accommodates the current model and might get the data loaded]. Again, current model has msrun name as unique, and has a single sample associated with it.

Sorry, I didn't quite catch that from above. Creating multiple MSRun records, one for each sample in the AccuCor file, would also work. In fact, it might be better since it would make it simpler to associate an mzXML file with the sample since there would be only one or two files per MSRun record.

To me, either strategy could be made to work and neither would create an insurmountable problem. In fact, we could pivot without much trouble if we decide later we'd like to change things.

hepcat72 commented 3 years ago

I would tend to agree with Lance's assessment. With the caveat that I think the implication here is that of the 2 options, if we go with a single name/date placeholder, I think it would require a model change now due to the relationship you describe John. So it seems like we'd have to go with the multiple entry solution to accomplish the task at hand. Perhaps I'm wrong?

If we do multiple, then I suggest a single name/date option applied to all samples' MSRun entries, like in a loop and we require currently that they're all the same.

jcmatese commented 3 years ago

OK, after discussion at yesterday's meeting, it was decided to retire the unique name attribute on the msrun class, and add a required researcher attribute (presumably a netid or full name for the researcher). I imagine this could eventually be replaced with a foreign key to a presumptive contact table, but for now, it is an unconstrained string. It was also discussed to add a unique constrain on the collective attributes, to prevent duplicate loading; probably could not hurt in the short term. I will update the issue description/requirements.

hepcat72 commented 3 years ago

I slacked a bit with Michael yesterday, trying to refine my understandings of the data relationships. I was trying to figure out what the best way of linking samples in an accucor output to its corresponding mzXML file. We already knew that the sample names in the accucor column headers should map to the name of the mzXML file, but he pointed out what he thought would be the (eventual final) solution, which is either:

supply the location of the shared drive folder of mzXML files to be recorded in the corresponding MSRun so that researchers can match the sample column headers to the files manually
upload the folder of mzXML files and link their Tracebase archive location to the MSRun record

The Accucor:Folder relationship is a M:1 relationship because Accucor can be rerun with an overlapping combination of mzXML files. I assume that an MSRun produces 1 plot (and that that plot maps to 1 sample) and that our MSRun table has a unique record for each sample's run and that a unique MSRun record represents a set of PeakGroups that all belong to a single plot. Thus a single MSRun record should map to 1 raw file. The conversion of that raw file however maps to multiple mzXML files (e.g. "positive"). The group of samples that undergo an MS analysis (each producing 1 MS plot) on a "holistic" MSRun are collected in a single folder. I assume that those samples all come from a single mouse. It's as-yet unclear to me that a run of accucor can only include mzXML files from 1 "holistic" MSRun (e.g. on a single day) or whether it can make sense to include mzXML files from another(/multiple) folder(s).

However inaccurate my understanding above is, 1 thing seems clear to me. In order to fully support the lookup of mzXML files, the load script should likely (at least eventually) record the associated folder of mzXML files in order for the researcher to be able to map sample columns to mzXML files.

I am trying to capture all these relationships in a diagram I shared with Michael. I am currently making corrections based on his feedback. I will share what I end up with.

lparsons commented 3 years ago

We may allow them to upload an entire folder at once, but we will be making a copy of any mzXML files and storing them in a separate location, accessible only via TraceBase. This was one point we discussed in the developer meeting and I had thought all agreed.

Each MSRun will be associated with 0 or more mzXML files (likely 0, 1, or 2) which will represent the "raw" data collected by the instrument for a single sample on a given date.

jcmatese commented 3 years ago

Related : Assay file of the ISA-Tab spec, for inspiration. For example, boil that down to a sample, researcher/performer, protocol, assay, file(s) spreadsheet for msrun batch loading in the future.

fkang-pu commented 3 years ago

I use my imagination to picture how we could allow a user to load MSRun and peak data with requirements we defined. On the upload webpage for MSRun, we could have a table/form with the following fields (or use a tsv file as input) for required metadata:

example data (assume using default protocol): animal_name tissue_name sample_name msrun_date researcher mzXML_file_name accucor_file_name 969 BAT bat-xz969 2020-01-01 Xianfeng Zhang dummy1a.mzXML obob_accucor_1.tsv 969 BAT bat-xz969 2020-01-01 Xianfeng Zhang dummy1b.mzXML obob_accucor_1.tsv 969 Brain br-xz969 2020-01-01 Xianfeng Zhang dummy2a.mzXML obob_accucor_1.tsv 969 Brain br-xz969 2020-01-01 Xianfeng Zhang dummy2b.mzXML obob_accucor_1.tsv

Note: we will need additional model/table (e.g. msrun_file).

hepcat72 commented 3 years ago

We may allow them to upload an entire folder at once, but we will be making a copy of any mzXML files and storing them in a separate location, accessible only via TraceBase. This was one point we discussed in the developer meeting and I had thought all agreed.

Granted. We did. I'm just questioning the order of implementation here - and whether we should be capturing those locations instead of researcher name and date in order to map mzXML to samples in an AccuCor file upload. If the plan is to temporarily support manual/separate lookup via researcher name & date and later add mzXML folder location, then that's fine. I'm still wrapping my head around all the data relationships.

I'm just wondering if we should have the mzXML folder upload worked out before we do this load script - or if we plan to modify the scipt to capture that later. My only concern is supporting that lookup. And the mapping to the folder is simpler to map out than name & date.

Michael said that at the time of upload of an AccuCor file, they will have the mzXML files in hand and know their location. So I just wondered if we should be capturing that as a required option in the run of this script. It would make name and date redundant.

Though, the relationship between folder and accucor outputs is 1:M (possibly M:M, though I doubt it and will learn more about it). And I understand that MSRun:AccuCor analysis is 1:M, because even if we consider an MSRun to represent 1 raw output for 1 sample, there are multiple possible mzXML conversions from raw and they do multiple accucur runs using different combos of mzXML files selected from El-Maven. (Admittedly, my head's spinning a bit here.) HOWEVER: there should be only 1 folder of mzXML files for an entire day's MSRun and the mapping of an accucor's output sample data is M:1 for accucor:folder.

I really need to finish my diagram to solidify all this in my head.

lparsons commented 3 years ago

I think we should write a simple script to load AccuCor data so we can begin to work with real data. I don't see a problem with adding the ability to store mzXML files at a later date.

I agree that, that in the future, when we support mzXML file upload, we'll include a way to associate those files with the MSRun (as Fan mentioned). Each file should be stored by TraceBase with a direct link from the MSRun record. We will not have pointers to the current storage (MSData share) that the lab is using since those files can be manipulated outside of TraceBase.

The researcher may or may not have access to a folder that has some or all of the mzXML files in it. The folder may be organized by researcher and date. However, we will still want to store the researcher and the date of the MSRun regardless. We will not be storing any folders or folder names that the researcher points to, but instead storing a copy of the mzXML files that the researcher provides. We should not rely on the method the lab uses to store their mzXML files.

hepcat72 commented 3 years ago

Yeah, that's all good. I'm just having a hard time seeing the full relationship between the mzXML files, the accucor file's assortment of samples, and the MSRun table entry and I think my concern is to determine whether the relationships we establish between sample, msrun, and peakgroup can affect the future relationship with mzXML files. Like, do we plan to add an mzXML link to the MSRun table? If so, is that a 1:1 with what we currently have as a unique MSRun record? There are multiple 1:M relationships between an MSRun and (a sample in) an AccuCor file. And my perception of what an MSRun was wasn't even accurate yesterday. I think the tough part is that the relationship we want is abstract on both ends. We want to relate an accucor output file with a folder of mzXML files, but what we have is MSRun and samples. An MSRun's raw file gets converted to mzXML (a 1:M conversion) and that can go though multiple peak calling El-Maven/AccuCor analyses (1:M). I know an El-Maven/AccuCor analysis involves several mzXML files, but are all those mzXML files from the same folder?

jcmatese commented 3 years ago

Related to adding researcher to the model, this is what it looks like to make a migration for a newly added non-nullable column

python manage.py makemigrations --name msrun_researcher DataRepo
You are trying to add a non-nullable field 'researcher' to msrun without a default; we can't do that (the database needs something to populate existing rows).
Please select a fix:
 1) Provide a one-off default now (will be set on all existing rows with a null value for this column)
 2) Quit, and let me add a default in models.py
Select an option: 1
Please enter the default value now, as valid Python
The datetime and django.utils.timezone modules are available, so you can do e.g. timezone.now
Type 'exit' to exit this prompt
>>> 'default'    
Migrations for 'DataRepo':
  DataRepo/migrations/0006_msrun_researcher.py
    - Remove field name from msrun
    - Add field researcher to msrun

resulting file looks like

# Generated by Django 3.1.8 on 2021-04-27 19:29

from django.db import migrations, models

class Migration(migrations.Migration):

    dependencies = [
        ('DataRepo', '0005_auto_20210421_1054'),
    ]

    operations = [
        migrations.RemoveField(
            model_name='msrun',
            name='name',
        ),
        migrations.AddField(
            model_name='msrun',
            name='researcher',
            field=models.CharField(default='default', max_length=256),
            preserve_default=False,
        ),
    ]