Princeton-LSI-ResearchComputing / tracebase

Mouse Metabolite Tracing Data Repository for the Rabinowitz Lab
MIT License
4 stars 1 forks source link

Validation interface creates and pre-populates a study excel doc #829

Closed hepcat72 closed 2 months ago

hepcat72 commented 9 months ago

FEATURE REQUEST

Inspiration

Issue #753 pretty much describes this (which is its end-goal), but basically, pre-populating as much study data as possible based on accucor files will really speed up compilation of a submission.

Description

Based on my submission process proposal from March and compiled and annotated in my full proposal, I think that we should use the version 3.0 effort as an opportunity to streamline the sheets in the excel file like I have in my (process) proposal:

Much of the study doc will be pre-populated (see discussion) by the validation interface. All the validation interface will require will be:

Optional additional inputs for validation:

The the interface section in the design below.

Goals:

Alternatives

None

Dependencies

Parent issue-tracking issue:

Comment

I created an example version of the Study Excel doc:

study.xlsx


ISSUE OWNER SECTION

Assumptions

None

Limitations

None

Affected Components

Requirements

Requirements added to what was copied from #753:

The requirements from #753:

  • [x] 3. Validation interface modes
  • [x] 3.1. Peak annotation only (with optional mzXML files)
  • [x] 3.1.1. Runs with only accucor/isocorr file(s) (currently, it requires the sample table file, I think)
  • [x] 3.1.2. Generates a stubbed-out study doc with the following tabs' pre-populated fields (see 8. for all changed columns)
  • [x] 3.1.2.1. Samples Pre-populated Columns (based on peak annotation file contents)
  • [x] 3.1.2.1.1. Sample Name (a heuristic will be used to remove _scan and _charge suffixes)
  • [x] 3.1.2.2. Treatments (optional - required if any are new) Pre-populated Columns
  • [x] 3.1.2.2.1. Animal Treatment (based on Study doc, Animals tab, Treatment column contents)
  • [x] 3.1.2.2.2. Description (based on database, empty/required if not in DB)
  • [x] 3.1.2.3. Tissues (optional - required if any are new) Pre-populated Columns
  • [x] 3.1.2.3.1. TraceBase Tissue Name (based on Study doc, Animals tab, Tissue column contents)
  • [x] 3.1.2.3.2. Description (based on database, empty/required if not in DB)
  • [x] 3.1.2.4. Infusates Pre-populated Columns
  • [x] 3.1.2.4.1. Infusate Number (based on Study doc, Tracers tab contents)
  • [x] 3.1.2.4.2. Tracer Group Name (if exists in the database)
  • [x] 3.1.2.4.3. Infusate Name (based on Study doc, Infusates tab's Tracer Group Name and Tracer Name columns)
  • [x] 3.1.2.5. Tracers Pre-populated Columns
  • [x] 3.1.2.5.1. Tracer Name
  • [x] 3.1.2.6. Compounds (optional - required if any are new) Pre-populated Columns
  • [x] 3.1.2.6.1. Compound (based on peak annotation file contents)
  • [x] 3.1.2.6.2. Formula (based on peak annotation file contents)
  • [x] 3.1.2.6.3. HMDB ID (if exists in the database)
  • [x] 3.1.2.6.4. Synonyms (if exists in the database)
  • [x] 3.1.2.7. Peak Annotation Files Pre-populated Columns
  • [x] 3.1.2.7.1. Peak Annotation File Name (based on peak annotation file names)
  • [x] 3.1.2.7.2. Peak Annotation File Type (inferred from peak annotation header contents)
  • ~[ ] 3.1.2.7.3. Sample Name Prefix (if not unique, uses study ID, if still not unique, uses animal ID, if still not unique, uses both. If not unique after that, it will keep both, but an error will prompt the user to manually change it.)~ Prefixes are no longer necessary, as the header to sample name mapping is literally in the Peak Annotation Details sheet. Previously, all samples had to be manually renamed in the samples sheet anyway and the only reason the prefix was necessary was to match the sample with a header in a peak annotation file. So with the details sheet, the prefix is no longer necessary.
  • [x] 3.1.2.8. Peak Annotation Details Pre-populated Columns
  • [x] 3.1.2.8.1. Sample Name (based on heuristically modified peak annotation file contents)
  • [x] 3.1.2.8.2. Sample Data Header (based on peak annotation file contents)
  • [x] 3.1.2.8.3. mzXML File Name (based on peak annotation file contents and omitted if mzXML files supplied and no match)
  • [x] 3.1.2.8.4. Peak Annotation File Name (based on peak annotation file name and sample header)
  • ~[ ] 3.1.2.8.5. Polarity (based on mzXML file content - empty if no matching file)~ Hence forth, polarity will only come from the mzXML file.
  • ~[ ] 3.1.2.9. Defaults (optional - required if any data is missing or generates errors/warnings, e.g. researcher name variation) Pre-populated Columns~ This is now handled by drop-downs.
  • ~[ ] 3.1.2.9.1. Researchers Confirmed (True if all are existing, empty/required if warnings/errors)~ Now only a warning, call attention to the researcher only.
  • ~[ ] 3.2. Study doc only (with optional mzXML files)~ mzXML files will be supplied by curators only before, during, or (default:) after a study load.
  • [x] 3.3. Full mode: Study doc and Peak annotation (with optional mzXML files)
  • [x] 3.4. Fields in the stub that require manual entry should be highlighted
  • [x] 3.5. Each pre-population action will be a separate method or a method that takes the tab name, column header, and row
  • [x] 5.1.1. Compounds
  • [x] 5.2.2. If no errors and not in validate mode, append rows to the consolidated data file (e.g. compounds.tsv)
  • [x] 5.2.3. If no errors and not in validate mode, remove the tab from the study doc

TODO: Be sure that the requirements above cover the different behaviors in validate mode (output a copy of the study doc that populates sheets like Compound, Treatment/Protocol, Tracer, Infusate, and Tissue with the contents of the database - to serve as an example/guide. It should also do things like add formulas to for example, create drop-downs for the Compound column on the tracers sheet.)

DESIGN

Interface Change description

The process for the end user will go like this:
  1. User submits 1 peak annotation file (and optionally, a study doc they wish to "add to") and gets back errors and a "Study" excel spreadsheet
    • The doc will be annotated with help comments in headers and (if the data is deemed ready for validation) error comments will be added to cells with bad values
    • The doc will be pre-populated based on DB contents and content parsed from peak annotation files
    • The doc will have drop-downs and formulas for drop-downs using contents of other sheets
  2. User has the option to fix any errors (e.g. add missing compounds to the compounds sheet) and/or go back to step 1 and continue to submit another pair of files to continue adding data
  3. At any point (with or without errors) proceed to submission, whether the study is done or not (submitting a file from an already loaded study will pre-populate all the study data loaded thus far)
The process for the curator will go like this:
  1. Receive a Study (just a study excel and peak annotation files)
  2. Run a script to extract new compounds/etc and update the consolidated list and remove the those sheets from the study doc, resolving issues with the new compounds/etc, reaching out tho the researcher as necessary
  3. Work on errors in the pared-down study excel file and continue running the load on the command line using a load-study script in validate mode, reaching out to the researcher(s) as needed to resolve issues, e.g. new compounds, etc
  4. Retrieve and load the mzXML files using the load_msruns.py script

Code Change Description

Create a pre-population method for each field to pre-populate. Keep track of what fields are required. Keep the pre-population methods independent of the loading scripts. Use heuristics for things like chopping _pos off sample names. Only unpopulated fields will be auto-populated.

Things like compounds, tissues, and treatments will be populated from existing consolidated data (either adding only data related to what's in the accucor/isocorr file(s) or all data), but after submission, curators will be removing this data after updating the consolidated docs to reduce overhead, though researchers will be allowed to hang onto their fully independent study doc.

Tests

A test for each requirement

hepcat72 commented 6 months ago

Just had our developer meeting, where we talked about a proof-of-concept test interface I created on a branch named dropzone_paths. I realized that given the concept of developing the functionality incrementally is appropriate given the immediate need. And I started thinking about incorporating a web interface version of #888 into this overall plan. I think that, as a component of this issue, and as per the points discussed in the meeting, there are a few ways this issue can be broken up or are otherwise worth noting:

hepcat72 commented 6 months ago

Given the considerations above, the following is the path I've devised to get to a resolution of issue #888.

NOTE: Only the first 3 steps are necessary for #888. The rest are the continuation of THIS issue, as implemented incrementally...

Breaking up issue #829:

See this comment for an updated plan.

hepcat72 commented 6 months ago

I intend to change the above plan. Important notes from slack:

Just as a preface, I think not including mzXMLs in the intermediate solution is copacetic with our overall goal of lowering the hurdles to submission. And while specifying a directory could be easier in some cases, we know that not everyone organizes their data in the same way, so it wouldn't make it easier in every case. We also know that the only time we need the user to go to the effort of mapping mzXMLs to samples in specific accucor files is when there are multiple scans of the same sample - but if the file names are identical in those cases, this interface doesn't solve that (since it can't include the directory).

In fact, I don't think there's a whole lot we can do for the user in the submission compilation process if there are multiple identically named files to be divided among multiple accucors, other than point out that there's no name match. After submission, we may be able to match things up by looking inside the files.

OK, so then I propose the following:

  1. Remove the mzXML drag and drop feature (which I think we agree on)
  2. Output an excel file using the current animal/sample template, with samples filled in
  3. I think it's worthwhile to also allow an animal/sample sheet input (so that samples can be added). I still think that just rebranding the validation page, limiting it to 1 of each file (animal/sample and peak annotation), and removing errors related to this use case (no animal/sample sheet) is the way to go, but if I can't convince you of that, I'll skip this and make a separate page if you insist. I'm just not convinced that that goes in the right direction.
  4. Make the peak annotation input a single field.

The other alternative I'd proposed earlier, of getting the LCMS loader done first, is moot if we're using the old template.

And later...

I did spend a few minutes BTW just now, polishing off the commit that I was 98% toward yesterday. I'll post a PR, even though some of the decisions above revert aspects of it. I'm also having second thoughts about my recommendation of limiting the interface to 1 of each (of the 2) file type(s), at least in the short term.

There's still the problem of timeouts if we try to validate the data and look for errors, but I realized that, if not given a sample sheet, we could simply not try to look for errors at all, and that would make it fast.

And while I was looking at the code just now, I realized that the reason I got on board with 1 at a time strategy was because of the identically named mzXML issue. By removing that as an obstacle, I don't see any reason we can't take a series of accucor/isocorr and output a list of all the samples in a single sample sheet.

I haven't fully fleshed this idea out, but it's enough to make me want to propose that we at least postpone implementing the limit in the next few commits/PRs. I say, that can be another incremental step, after the next meeting, and after having worked with the code.

I think that my current thinking is to change the validation page to

  1. create an excel
  2. allow multiple peak annotation inputs
  3. infer accucor/isocorr
  4. change modes to not do a validation if no sample file is supplied.
  5. remove the mzXML file input.
hepcat72 commented 6 months ago

Just some notes on creating an excel file in code...

https://stackoverflow.com/a/13437855/2057516

Example:

df.to_excel('test.xlsx', sheet_name='sheet1', index=False)
# Do stuff, e.g.:
# - Add a color to cells: `df.style.applymap(lambda val: "color: %s" % ("red" if val < 0 else "green"))`
writer = pd.ExcelWriter("pandas_simple.xlsx", engine="xlsxwriter")
worksheet = writer.sheets["Sheet1"]
# Do stuff, e.g.:
# - Add a comment to a cell: `worksheet.write_comment("A1", "This is a comment", {"visible": True})`
# - Add a formula to a cell: `worksheet.write_formula('A1', '=SUM(1, 2, 3)')`
writer.close()

Status of features:

hepcat72 commented 6 months ago
writer = pd.ExcelWriter("pandas_simple.xlsx", engine="xlsxwriter")
workbook = writer.book
worksheet = writer.sheets["Sheet1"]

# Formats
missing_format = workbook.add_format({'bold': True, 'color': '#FF0000'})
error_format = workbook.add_format({'bold': True, 'color': 'red'})
date_format = workbook.add_format({'num_format': 'mmm d yyyy hh:mm AM/PM'})

# Example:
worksheet.write(0, 0, 'Foo', cell_format)
worksheet.write_comment(row=0, col=0, "comment")

worksheet.write(0, 0, 'Foo', cell_format)
worksheet.write_string(1, 0, 'Bar', cell_format)
worksheet.write_number(2, 0, 3, cell_format)
worksheet.write_blank (3, 0, '', cell_format)
worksheet.write_formula(4, 0, "=SUM(1, 2, 3)", cell_format)
worksheet.write_datetime(0, 0, datetime, date_format)
worksheet.write_boolean()
worksheet.write_url()

# Set a whole row/col
worksheet.set_row(0, 18, cell_format)
worksheet.set_column('A:D', 20, cell_format)

# Extra stuff once I get something simple working:
worksheet.data_validation('A1', {'validate': 'integer',
                                 'criteria': '>',
                                 'value': 100})
worksheet.conditional_format('B3:K12', {'type':     'cell',
                                        'criteria': '>=',
                                        'value':    50,
                                        'format':   format1})
worksheet.conditional_format('A1:A4', {'type':   'errors',
                                       'format': format1})
worksheet.conditional_format('A2:C9' ,
    {'type':     'formula',
     'criteria': '=OR($B2<$C2,AND($B2="",$C2>TODAY()))',
     'format':   format1
    })
hepcat72 commented 6 months ago

New plan:

I think that it might be a good idea to focus the validation efforts at either the point where they attempt to submit, or perhaps we can create a way to collect components of a submission, piecemeal, and at each addition, associate it with an error report. ... Still thinking this through ...

hepcat72 commented 6 months ago

Note that the read excel method, when you set the sheet to None, it returns a dict with a dataframe for each sheet

create_or_update_study_doc: I think the thing to do would be to:

  1. Obtain a dataframe dict from each accucor
    • If an animal sample file was supplied, read dataframes from the provided excel file
    • Otherwise, create a dataframe dict with all the required columns
  2. Attempt to load the files supplied
  3. Add data to appropriate errors (e.g. add sample names parsed from accucor headers to NoSamplesError)
  4. Process each error and iteratively populate the dataframes based on the errors
  5. Process the dataframes to fill in standard placeholder values (e.g. add every sample to the same animal named "TBD")
  6. Add errors as comments to cells
  7. Format cells to highlight errors, warnings, required missing values, required values with TBD, read-only columns, etc
  8. Add formulas to cells (like the read-only cells)
hepcat72 commented 6 months ago

Notes on streaming a file... See my notes above on decorating the spreadsheets.

According to the openpyxl documentation, you can stream the excel content by doing the following (which I modified slightly based on other googling):

from tempfile import NamedTemporaryFile
from openpyxl import Workbook
wb = Workbook()
with NamedTemporaryFile(suffix='.xlsx') as tmp:
    wb.save(tmp.name)
    tmp.seek(0)
    stream = tmp.read()

Other notes:

# Create sheets
ws1 = wb.create_sheet("Mysheet") # insert at the end (default)
ws2 = wb.create_sheet("Mysheet", 0) # insert at first position

# Set sheet name
ws1.title = "New Title"

I was a little confused about how I get from the pandas dataframes to the stream, so I found this stack post, which says:

import pandas as pd
from django.http import HttpResponse
from io import BytesIO

excel_file = BytesIO()
xlwriter = pd.ExcelWriter(excel_file, engine='xlsxwriter')

# I modified the answer to put it in terms of my `dfs_dict`, which can have a dataframe keyed on sheets
df_output = {}
for sheet in dfs_dict.keys():
    df_output[sheet] = pd.DataFrame.from_dict(dfs_dict[sheet])
    df_output[sheet].to_excel(xlwriter, sheet)

xlwriter.save()
xlwriter.close()

# important step, rewind the buffer or when it is read() you'll get nothing
# but an error message when you try to open your zero length file in Excel
excel_file.seek(0)

# set the mime type so that the browser knows what to do with the file
response = HttpResponse(excel_file.read(), content_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')

# set the file name in the Content-Disposition header
response['Content-Disposition'] = 'attachment; filename=myfile.xlsx'

return response
hepcat72 commented 6 months ago

So I realized that there's a minor problem. I learned that you can't both render_to_response and HttpResponse(excel_file...) in 1 go. I was planning to both kick off the download as a stream (never hitting disk) and still be able to render an error report (if any errors exist), at the same time.

Since that's not possible, unless I figure out a way to preserve the file data (say, embed it in the template), I will have to save a temp file and then use javascript to trigger its download in the resulting page.

I wonder if embedding is feasible, the way I created the download of the lcms file...

hepcat72 commented 6 months ago

Nice-to-haves (just documenting ideas - not necessarily adding to the implementation plan)

hepcat72 commented 3 months ago

TODO:

Going to augment the existing DataValidationView as a step toward an end-product, instead of crafting a new class that the view just uses. There is a lot of existing code and experimentation with what's possible will yield a better overall understanding of the elements involved that will inform a refactor. It also lets me get the code out faster.

hepcat72 commented 3 months ago

Nice-to-have TODO:

hepcat72 commented 2 months ago

Might be able to:

hepcat72 commented 2 months ago

This issue's remaining items were made into separate issues. I checked off every item that was made into a separate issue. A consolidated list of remaining items was made in #1034 as well (incorporating items from feedback/testing).