Validation interface creates and pre-populates a study excel doc

FEATURE REQUEST

Inspiration

Issue #753 pretty much describes this (which is its end-goal), but basically, pre-populating as much study data as possible based on accucor files will really speed up compilation of a submission.

Description

Based on my submission process proposal from March and compiled and annotated in my full proposal, I think that we should use the version 3.0 effort as an opportunity to streamline the sheets in the excel file like I have in my (process) proposal:

Much of the study doc will be pre-populated (see discussion) by the validation interface. All the validation interface will require will be:

Accucor/isocorr files

Optional additional inputs for validation:

Study Doc
~mzXML files (they will not be uploaded - their names will be compiled and submitted)~

The the interface section in the design below.

Goals:

Validating an entire study may not be possible in a live back and forth. The web browser would time out. Either this would need to run a handful of files (or 1) at a time, a celery progress bar would have to be used, or the user would have to be emailed a (link to a) report.
The process should be verbose about warnings WRT enforcing sample uniqueness and try to catch when samples appear to have different names, but come from the same sample (e.g. warn about sample1_pos being the same as sample1).
Data that needs to be manually entered should be highlighted in some way.
It may or may not be too cumbersome to actually upload the mzXML files. If we did, we could parse and use the data in the files.
Many existing sample duplicates already exist in the database with different "sample names". A heuristic to discern the actual sample name (e.g. by removing suffixes like "_pos") will not preclude the necessity of the user to have to validate all of the sample names. It should issue warnings when sample names look too much alike (e.g. one sample name contains another).

Alternatives

None

Dependencies

Parent issue-tracking issue:

Comment

I created an example version of the Study Excel doc:

study.xlsx

ISSUE OWNER SECTION

Assumptions

None

Limitations

None

Affected Components

change: validation.py

Requirements

Requirements added to what was copied from #753:

[x] 1. Users must be allowed to add compounds, tracers, infusates, tissues, protocols without involving curator correspondence, with the following caveats:
- [x] 1.1. New such additions must be flagged for curator review in some way. I.e. the curator must be made to explicitly approve these additions in some way.
- [x] 1.2. The data (all compounds, tracers, etc) sheets must be removed from the study sheet and added to the corresponding consolidated docs

The requirements from #753:

[x] 3. Validation interface modes

[x] 3.1. Peak annotation only (with optional mzXML files)

[x] 3.1.1. Runs with only accucor/isocorr file(s) (currently, it requires the sample table file, I think)

[x] 3.1.2. Generates a stubbed-out study doc with the following tabs' pre-populated fields (see 8. for all changed columns)

[x] 3.1.2.1. Samples Pre-populated Columns (based on peak annotation file contents)

[x] 3.1.2.1.1. Sample Name (a heuristic will be used to remove _scan and _charge suffixes)

[x] 3.1.2.2. Treatments (optional - required if any are new) Pre-populated Columns

[x] 3.1.2.2.1. Animal Treatment (based on Study doc, Animals tab, Treatment column contents)

[x] 3.1.2.2.2. Description (based on database, empty/required if not in DB)

[x] 3.1.2.3. Tissues (optional - required if any are new) Pre-populated Columns

[x] 3.1.2.3.1. TraceBase Tissue Name (based on Study doc, Animals tab, Tissue column contents)

[x] 3.1.2.3.2. Description (based on database, empty/required if not in DB)

[x] 3.1.2.4. Infusates Pre-populated Columns

[x] 3.1.2.4.1. Infusate Number (based on Study doc, Tracers tab contents)

[x] 3.1.2.4.2. Tracer Group Name (if exists in the database)

[x] 3.1.2.4.3. Infusate Name (based on Study doc, Infusates tab's Tracer Group Name and Tracer Name columns)

[x] 3.1.2.5. Tracers Pre-populated Columns

[x] 3.1.2.5.1. Tracer Name

[x] 3.1.2.6. Compounds (optional - required if any are new) Pre-populated Columns

[x] 3.1.2.6.1. Compound (based on peak annotation file contents)

[x] 3.1.2.6.2. Formula (based on peak annotation file contents)

[x] 3.1.2.6.3. HMDB ID (if exists in the database)

[x] 3.1.2.6.4. Synonyms (if exists in the database)

[x] 3.1.2.7. Peak Annotation Files Pre-populated Columns

[x] 3.1.2.7.1. Peak Annotation File Name (based on peak annotation file names)

[x] 3.1.2.7.2. Peak Annotation File Type (inferred from peak annotation header contents)

~[ ] 3.1.2.7.3. Sample Name Prefix (if not unique, uses study ID, if still not unique, uses animal ID, if still not unique, uses both. If not unique after that, it will keep both, but an error will prompt the user to manually change it.)~ Prefixes are no longer necessary, as the header to sample name mapping is literally in the Peak Annotation Details sheet. Previously, all samples had to be manually renamed in the samples sheet anyway and the only reason the prefix was necessary was to match the sample with a header in a peak annotation file. So with the details sheet, the prefix is no longer necessary.

[x] 3.1.2.8. Peak Annotation Details Pre-populated Columns

[x] 3.1.2.8.1. Sample Name (based on heuristically modified peak annotation file contents)

[x] 3.1.2.8.2. Sample Data Header (based on peak annotation file contents)

[x] 3.1.2.8.3. mzXML File Name (based on peak annotation file contents and omitted if mzXML files supplied and no match)

[x] 3.1.2.8.4. Peak Annotation File Name (based on peak annotation file name and sample header)

~[ ] 3.1.2.8.5. Polarity (based on mzXML file content - empty if no matching file)~ Hence forth, polarity will only come from the mzXML file.

~[ ] 3.1.2.9. Defaults (optional - required if any data is missing or generates errors/warnings, e.g. researcher name variation) Pre-populated Columns~ This is now handled by drop-downs.

~[ ] 3.1.2.9.1. Researchers Confirmed (True if all are existing, empty/required if warnings/errors)~ Now only a warning, call attention to the researcher only.

~[ ] 3.2. Study doc only (with optional mzXML files)~ mzXML files will be supplied by curators only before, during, or (default:) after a study load.

[x] 3.3. Full mode: Study doc and Peak annotation (with optional mzXML files)

[x] 3.4. Fields in the stub that require manual entry should be highlighted

[x] 3.5. Each pre-population action will be a separate method or a method that takes the tab name, column header, and row

[x] 5.1.1. Compounds

[x] 5.2.2. If no errors and not in validate mode, append rows to the consolidated data file (e.g. compounds.tsv)

[x] 5.2.3. If no errors and not in validate mode, remove the tab from the study doc

TODO: Be sure that the requirements above cover the different behaviors in validate mode (output a copy of the study doc that populates sheets like Compound, Treatment/Protocol, Tracer, Infusate, and Tissue with the contents of the database - to serve as an example/guide. It should also do things like add formulas to for example, create drop-downs for the Compound column on the tracers sheet.)

DESIGN

Interface Change description

The process for the end user will go like this:

User submits 1 peak annotation file (and optionally, a study doc they wish to "add to") and gets back errors and a "Study" excel spreadsheet
- The doc will be annotated with help comments in headers and (if the data is deemed ready for validation) error comments will be added to cells with bad values
- The doc will be pre-populated based on DB contents and content parsed from peak annotation files
- The doc will have drop-downs and formulas for drop-downs using contents of other sheets
User has the option to fix any errors (e.g. add missing compounds to the compounds sheet) and/or go back to step 1 and continue to submit another pair of files to continue adding data
At any point (with or without errors) proceed to submission, whether the study is done or not (submitting a file from an already loaded study will pre-populate all the study data loaded thus far)

The process for the curator will go like this:

Receive a Study (just a study excel and peak annotation files)
Run a script to extract new compounds/etc and update the consolidated list and remove the those sheets from the study doc, resolving issues with the new compounds/etc, reaching out tho the researcher as necessary
Work on errors in the pared-down study excel file and continue running the load on the command line using a load-study script in validate mode, reaching out to the researcher(s) as needed to resolve issues, e.g. new compounds, etc
Retrieve and load the mzXML files using the load_msruns.py script

Code Change Description

Create a pre-population method for each field to pre-populate. Keep track of what fields are required. Keep the pre-population methods independent of the loading scripts. Use heuristics for things like chopping _pos off sample names. Only unpopulated fields will be auto-populated.

Things like compounds, tissues, and treatments will be populated from existing consolidated data (either adding only data related to what's in the accucor/isocorr file(s) or all data), but after submission, curators will be removing this data after updating the consolidated docs to reduce overhead, though researchers will be allowed to hang onto their fully independent study doc.

The validation page
- will determine the mode based on what's provided
- If no study doc is provided, it generates a study doc with pre-populated fields. If a field is required but cannot be pre-populated, placeholder values will be entered. And missing or placeholder values will be highlighted for required manual entry.
- If a study doc is provided, missing required values that can be automatically populated, it will add data, but it will not modify existing data supplied in the file

Tests

A test for each requirement

Just had our developer meeting, where we talked about a proof-of-concept test interface I created on a branch named dropzone_paths. I realized that given the concept of developing the functionality incrementally is appropriate given the immediate need. And I started thinking about incorporating a web interface version of #888 into this overall plan. I think that, as a component of this issue, and as per the points discussed in the meeting, there are a few ways this issue can be broken up or are otherwise worth noting:

The immediate need to service is: associating sample names with mzXML files
mzXML files can be very large (meaning uploading them takes a good bit of time)
We don't actually need the files to associate samples with them
- Though, we would need them to parse out: raw file hash, polarity, scan range, and to generate a hash of the mzXML file
- However, having that derived data would not be necessary (to make the MSRunSample records unique) if we're validating only 1 accucor file at a time (and conflicting mzXML records have not yet been loaded into the database). In fact, in validate mode, it may be possible to generate unique hash values for each of the virtual mzXML files (because the parsed values are not necessary if the file is present.)
The ultimate output would be to use the new study.xlsx, but this could be implemented on the current animal/sample sheet and then be updated to handle the new format
888 could be accomplished using the strategy developed in branch dropzone_paths, either as:
- Its own page (if it would be challenging to accomplish the entire arc (from implementation to merge) would be challenging to accomplish [determining that now])
- An added input element to the current validation page. Things to do to get there:
- Missing sample errors would have to either accompany, or be replaced by the generation of, a sample sheet that is pre-populated
- The validation page would have to be changed to either take only a single accucor/isocorr at a time (or issue an error if more than 1 peak annotation file is submitted when mzXML files are supplied)
- Optionally implement a heuristic to identify modifications of sample names (e.g. remove appended suffixes like _pos, etc.)
- Associate samples, sample headers, mzXMLs, and accucor name in an LCMS sheet (we could manually use this on our end to generate the currently required --lcms-file as a tsv)
- Figure out how to create an excel file available for download on the resulting page, containing the automatically generated data
  - Nice to have, would be fields highlighted that are problematic (unassociated mzXMLs and sample headers)
- As an initial step, it's worth noting that resubmitting an accucor with a manually updated excel file will not (yet) be able to use the LCMS sheet(s) (which will be created during issue #825 and then incorporated into the validation interface in #839).

Given the considerations above, the following is the path I've devised to get to a resolution of issue #888.

NOTE: Only the first 3 steps are necessary for #888. The rest are the continuation of THIS issue, as implemented incrementally...

Breaking up issue #829:

[x] 1. Integrate the changes in the dropzone_paths branch into the Validation Interface to be able to take an mzXML file list
[x] 2. Create a method in the ValidationView to build an LCMS file given the mzXML file name list and an accucor file
[x] 3. FINAL OUTPUT FOR ISSUE #888: Provide the LCMS file for download in the ValidationView. The issues that follow this one are to work toward a more polished result
[x] 4. OPTIONAL [DO LATER?]: Create a method that takes a sample header list and a list of possible suffixes (_pos, _neg, _scan1, etc) and returns a dict of accucor sample header to db sample name (called by the method in step 3 above that generates the LCMS file)
~[ ] 5. Update the ValidationView to supply the LCMS file to the loader~
~[ ] 6. Update the accucor load script to create a unique placeholder hash for mzXML files from the LCMS metadata file (and create placeholder ArchiveFile records) when in --validate mode~
~[ ] 7. Make the ValidationView catch NoSamplesError and MissingSamplesError (add the list of sample names to the exceptions, if not already there) and pass the sample names and LCMS file to a (placeholder) method (create_or_update_study_doc) that will create (or update) the animals/samples excel doc~
~[ ] 8. OPTIONAL [DO LATER?]: Make the ValidationView remove NoSamplesError and MissingSamplesErrors from the rendered view~
~[ ] 9. Implement the create_or_update_study_doc method to generate a file with sample names added~
~[ ] 10. OPTIONAL [DO LATER?]: Make the create_or_update_study_doc method able to highlight required empty fields~
~[ ] 11. OPTIONAL [DO LATER?]: Make the create_or_update_study_doc method able to Add placeholders for other required fields (like Animal Name, etc). Initially, every sample would be added to an "undefined" or "unknown" animal field that is highlighted for edit (and make the non-validate mode raise an error if that undefined value is encountered)~
~[ ] 12. Make the ValidationView provide a link to download the created study excel doc (with some stats showing the number of created samples, etc)~
~[ ] 13. OPTIONAL/TEMPORARY: Make the ValidationView able to (temporarily) generate the LCMS tsv file, provide it for download along with the study doc, and take it as input. This is a temporary issue so that users can revalidate data after implementing fixes. It will be replaced by the ability of the loading scripts to read the LCMS data from the excel study doc (covered in issues #825 and #829)~
~[ ] 14. Rebrand the Validation Interface into a "Build a Submission" interface~

See this comment for an updated plan.

I intend to change the above plan. Important notes from slack:

Just as a preface, I think not including mzXMLs in the intermediate solution is copacetic with our overall goal of lowering the hurdles to submission. And while specifying a directory could be easier in some cases, we know that not everyone organizes their data in the same way, so it wouldn't make it easier in every case. We also know that the only time we need the user to go to the effort of mapping mzXMLs to samples in specific accucor files is when there are multiple scans of the same sample - but if the file names are identical in those cases, this interface doesn't solve that (since it can't include the directory).

In fact, I don't think there's a whole lot we can do for the user in the submission compilation process if there are multiple identically named files to be divided among multiple accucors, other than point out that there's no name match. After submission, we may be able to match things up by looking inside the files.

OK, so then I propose the following:

Remove the mzXML drag and drop feature (which I think we agree on)

Output an excel file using the current animal/sample template, with samples filled in

I think it's worthwhile to also allow an animal/sample sheet input (so that samples can be added). I still think that just rebranding the validation page, limiting it to 1 of each file (animal/sample and peak annotation), and removing errors related to this use case (no animal/sample sheet) is the way to go, but if I can't convince you of that, I'll skip this and make a separate page if you insist. I'm just not convinced that that goes in the right direction.

Make the peak annotation input a single field.

The other alternative I'd proposed earlier, of getting the LCMS loader done first, is moot if we're using the old template.

And later...

I did spend a few minutes BTW just now, polishing off the commit that I was 98% toward yesterday. I'll post a PR, even though some of the decisions above revert aspects of it. I'm also having second thoughts about my recommendation of limiting the interface to 1 of each (of the 2) file type(s), at least in the short term.

There's still the problem of timeouts if we try to validate the data and look for errors, but I realized that, if not given a sample sheet, we could simply not try to look for errors at all, and that would make it fast.

And while I was looking at the code just now, I realized that the reason I got on board with 1 at a time strategy was because of the identically named mzXML issue. By removing that as an obstacle, I don't see any reason we can't take a series of accucor/isocorr and output a list of all the samples in a single sample sheet.

I haven't fully fleshed this idea out, but it's enough to make me want to propose that we at least postpone implementing the limit in the next few commits/PRs. I say, that can be another incremental step, after the next meeting, and after having worked with the code.

I think that my current thinking is to change the validation page to

create an excel
allow multiple peak annotation inputs
infer accucor/isocorr
change modes to not do a validation if no sample file is supplied.
remove the mzXML file input.

Just some notes on creating an excel file in code...

https://stackoverflow.com/a/13437855/2057516

Example:

df.to_excel('test.xlsx', sheet_name='sheet1', index=False)
# Do stuff, e.g.:
# - Add a color to cells: `df.style.applymap(lambda val: "color: %s" % ("red" if val < 0 else "green"))`
writer = pd.ExcelWriter("pandas_simple.xlsx", engine="xlsxwriter")
worksheet = writer.sheets["Sheet1"]
# Do stuff, e.g.:
# - Add a comment to a cell: `worksheet.write_comment("A1", "This is a comment", {"visible": True})`
# - Add a formula to a cell: `worksheet.write_formula('A1', '=SUM(1, 2, 3)')`
writer.close()

Status of features:

Pandas
- comments: Looks like comments can be applied using pd.ExcelWriter, though they use the xlsxwriter engine. May be better to stick with openpyxl if possible. xlsxwriter example
- colors: df.style.applymap(lambda val: "color: %s" % ("red" if val < 0 else "green")) source
- formulas: Another xlswriter suggestion/example. They say you're better off using it "instead of pandas". (Use the docs. Commenters point out a problem in the stack answer.)

writer = pd.ExcelWriter("pandas_simple.xlsx", engine="xlsxwriter")
workbook = writer.book
worksheet = writer.sheets["Sheet1"]

# Formats
missing_format = workbook.add_format({'bold': True, 'color': '#FF0000'})
error_format = workbook.add_format({'bold': True, 'color': 'red'})
date_format = workbook.add_format({'num_format': 'mmm d yyyy hh:mm AM/PM'})

# Example:
worksheet.write(0, 0, 'Foo', cell_format)
worksheet.write_comment(row=0, col=0, "comment")

worksheet.write(0, 0, 'Foo', cell_format)
worksheet.write_string(1, 0, 'Bar', cell_format)
worksheet.write_number(2, 0, 3, cell_format)
worksheet.write_blank (3, 0, '', cell_format)
worksheet.write_formula(4, 0, "=SUM(1, 2, 3)", cell_format)
worksheet.write_datetime(0, 0, datetime, date_format)
worksheet.write_boolean()
worksheet.write_url()

# Set a whole row/col
worksheet.set_row(0, 18, cell_format)
worksheet.set_column('A:D', 20, cell_format)

# Extra stuff once I get something simple working:
worksheet.data_validation('A1', {'validate': 'integer',
                                 'criteria': '>',
                                 'value': 100})
worksheet.conditional_format('B3:K12', {'type':     'cell',
                                        'criteria': '>=',
                                        'value':    50,
                                        'format':   format1})
worksheet.conditional_format('A1:A4', {'type':   'errors',
                                       'format': format1})
worksheet.conditional_format('A2:C9' ,
    {'type':     'formula',
     'criteria': '=OR($B2<$C2,AND($B2="",$C2>TODAY()))',
     'format':   format1
    })

New plan:

[x] 1. Integrate the changes in the dropzone_paths branch into the Validation Interface to be able to take an mzXML file list
[x] 2. Create a method in the ValidationView to build an LCMS file given the mzXML file name list and an accucor file
[x] 3. FINAL OUTPUT FOR ISSUE #888: Provide the LCMS file for download in the ValidationView. The issues that follow this one are to work toward a more polished result
[x] 4. OPTIONAL [DO LATER?]: Create a method that takes a sample header list and a list of possible suffixes (_pos, _neg, _scan1, etc) and returns a dict of accucor sample header to db sample name (called by the method in step 3 above that generates the LCMS file)
[x] 5. Remove the mzXML file input.
[x] 6. infer accucor/isocorr
[x] 7. Change "Validate" button to "Download Submission Template" button
[x] 8. Catch NoSamplesError and MissingSamplesError in the ValidationView and add the list of sample names to the exceptions, if not already there.
[x] 9. Fill in sample names
- [x] 9.1. If no sample sheet provided, create dataframes for each sheet, otherwise, retrieve the dataframes of the supplied animal/sample doc
- [x] 9.2. Catch exceptions containing data (AllMissingSampelsError) and make updates/additions to the dataframe
- ~[ ] 9.3. Create an excel file (given multiple peak annotation files) with sample names filled in. Call it with the sample names extracted from the sample errors. Have it use the method that extracts tracebase sample names from the sample headers.~ This is unnecessary. The file shouldn't hit disk. Just stream it to the user's downloads.
[x] 10. Remove NoSamplesErrors, MissingSamplesErrors, and AllMissingSamplesError
[x] 11. Make the ValidationView automatically download the created study excel doc (with some stats showing the number of created samples, etc)
[x] 12. Make the errors reported:
- [x] 12.1. If a sample table was supplied - initially hidden, but expandable underneath some stats and make any errors handled by autofill into warnings (or eliminate them)
- [x] 12.2. If a sample table was not supplied - delete any errors associated with added data). If all pass, also delete all of the load statuses and do not present results at all.
[x] 13. OPTIONAL [DO LATER?]: Make the create_or_update_study_doc method able to Add placeholders for other required fields (like Animal Name, etc) and other data. Initially, every sample would be added to an "undefined" or "unknown" animal field that is highlighted for edit (and make the non-validate mode raise an error if that undefined value is encountered)
[x] 14. OPTIONAL [DO LATER?]: time- and/or date-stamp the file name
[x] 15. Add a method to shortcut the processing when no animal/sample sheet is supplied to only grab data to be added
[x] 16. OPTIONAL [DO LATER?]: If the study exists, automatically add all existing data to the spreadsheet (I realized in testing that sample names are not added on some example sets I'd tested with, because they'd already been loaded - so they weren't "missing")
[x] 17. OPTIONAL [DO LATER?]: Add an errors "sheet" that lists errors unassociated with a sheet/row/column
[x] 18. OPTIONAL [DO LATER?]: In addition to "errors" & "warnings" in the validation report, add a new category: either "info" or "additions" or "updates" or "autofills" or "fixes"...? - any of which could change a FAILED status to PASSED. Perhaps even PASS/FAIL can be changed to some sort of readiness category or a progress status/grid that indicates completeness of: required animal/sample data, suggested animal/samples data,dependent data (tissues/treatments/compounds/etc), run metadata, and a count of errors/warnings state.
[x] 19. Update the upload page to reflect the rebranding
[x] 20. OPTIONAL [DO LATER?]: Add stats to the results: number of errors, records that would be created, data added, etc.

I think that it might be a good idea to focus the validation efforts at either the point where they attempt to submit, or perhaps we can create a way to collect components of a submission, piecemeal, and at each addition, associate it with an error report. ... Still thinking this through ...

Note that the read excel method, when you set the sheet to None, it returns a dict with a dataframe for each sheet

create_or_update_study_doc: I think the thing to do would be to:

Obtain a dataframe dict from each accucor
- If an animal sample file was supplied, read dataframes from the provided excel file
- Otherwise, create a dataframe dict with all the required columns
Attempt to load the files supplied
Add data to appropriate errors (e.g. add sample names parsed from accucor headers to NoSamplesError)
Process each error and iteratively populate the dataframes based on the errors
Process the dataframes to fill in standard placeholder values (e.g. add every sample to the same animal named "TBD")
Add errors as comments to cells
Format cells to highlight errors, warnings, required missing values, required values with TBD, read-only columns, etc
Add formulas to cells (like the read-only cells)

Notes on streaming a file... See my notes above on decorating the spreadsheets.

According to the openpyxl documentation, you can stream the excel content by doing the following (which I modified slightly based on other googling):

from tempfile import NamedTemporaryFile
from openpyxl import Workbook
wb = Workbook()
with NamedTemporaryFile(suffix='.xlsx') as tmp:
    wb.save(tmp.name)
    tmp.seek(0)
    stream = tmp.read()

Other notes:

# Create sheets
ws1 = wb.create_sheet("Mysheet") # insert at the end (default)
ws2 = wb.create_sheet("Mysheet", 0) # insert at first position

# Set sheet name
ws1.title = "New Title"

I was a little confused about how I get from the pandas dataframes to the stream, so I found this stack post, which says:

import pandas as pd
from django.http import HttpResponse
from io import BytesIO

excel_file = BytesIO()
xlwriter = pd.ExcelWriter(excel_file, engine='xlsxwriter')

# I modified the answer to put it in terms of my `dfs_dict`, which can have a dataframe keyed on sheets
df_output = {}
for sheet in dfs_dict.keys():
    df_output[sheet] = pd.DataFrame.from_dict(dfs_dict[sheet])
    df_output[sheet].to_excel(xlwriter, sheet)

xlwriter.save()
xlwriter.close()

# important step, rewind the buffer or when it is read() you'll get nothing
# but an error message when you try to open your zero length file in Excel
excel_file.seek(0)

# set the mime type so that the browser knows what to do with the file
response = HttpResponse(excel_file.read(), content_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')

# set the file name in the Content-Disposition header
response['Content-Disposition'] = 'attachment; filename=myfile.xlsx'

return response

So I realized that there's a minor problem. I learned that you can't both render_to_response and HttpResponse(excel_file...) in 1 go. I was planning to both kick off the download as a stream (never hitting disk) and still be able to render an error report (if any errors exist), at the same time.

Since that's not possible, unless I figure out a way to preserve the file data (say, embed it in the template), I will have to save a temp file and then use javascript to trigger its download in the resulting page.

I wonder if embedding is feasible, the way I created the download of the lcms file...

Nice-to-haves (just documenting ideas - not necessarily adding to the implementation plan)

[x] Ability to supply suffixes to remove from sample names (e.g. "_pos", "_neg", "_scan[0-9]+")
[x] Checkbox (default-selected) to "repair" problems (and notate problems with comments in the excel)
[x] Change the study file field to a text field with "study.xlsx" and a button next to it named something like "add to existing" that changes the text field to a file field
[x] Save these things in cookies so that settings are saved and hide in an "advanced"/"settings" widget in the top-right corner

TODO:

Going to augment the existing DataValidationView as a step toward an end-product, instead of crafting a new class that the view just uses. There is a lot of existing code and experimentation with what's possible will yield a better overall understanding of the elements involved that will inform a refactor. It also lets me get the code out faster.

[x] Replace the animals properties with the AnimalsLoader calls (like how the treatments and tissues are handled)
[x] Replace the samples properties with the SamplesLoader calls (like how the treatments and tissues are handled)
Add sheets
- [x] Study
- [x] Infusates
- [x] Tracers
- [x] Compounds
- [x] LC Protocols
- [x] Sequences
- [x] Peak Annotation Files
- [x] Peak Annotation Details
- [x] Defaults
- [x] Errors
Apply header comments to:
- [x] Study
- [x] Infusates
- [x] Tracers
- [x] Compounds
- [x] LC Protocols
- [x] Sequences
- [x] Peak Annotation Files
- [x] Peak Annotation Details
- [x] Defaults
Auto-populate sheets
- entirely from DB content using...
- Entire DB content
  - [x] LC Protocols
- Selected DB content
  - Using the DB alone
  - ~[ ] Sequences (either the last N by date, and/or researcher [if we add that to the form or start logging people in])~
  - Using file contents and DB
  - Using peak annotation files
    - [x] Infusates (using the peak annotation compound content for queries of compound names)
    - [x] Tracers (using the peak annotation compound content for queries of compound names)
    - [x] Compounds (using the peak annotation compound content for queries of compound names)
    - [x] Peak Annotation Files (name of file and format determination)
    - [x] Peak Annotation Details (using the peak annotation sample header, file name)
  - Using partially filled in study doc content
    - [x] Any missing column values using unique identifiers in any sheet
    - [x] Study sheet
      - [x] Fill in study name if referenced in Animals' Study column
    - [x] Animals
      - [x] Fill in Animal name if referenced in Samples' Animal column
      - ~[ ] Fill in Infusate if there's only 1 infusate (fully filled in) in the infusates sheet~ removing the 1 record assumption - this will be accomplished by manual selection from drop-downs
    - [x] Treatments sheet
      - [x] Fill in treatment name if referenced in Animals' treatment column
    - [x] Tissues sheet
      - [x] Fill in name if referenced in the Tissue column of the Samples sheet
    - [x] Infusates sheet
      - [x] Fill in infusate name if referenced in Animals' Infusate column (and all other columns using parsed data)
    - [x] Tracers sheet
      - [x] Fill in Tracer name and infusate number if parsed from an infusate in the animals sheet (and autofill all other columns using parsed data)
    - [x] LC Protocols sheet
      - [x] Fill in all columns based on name parsed from LC Protocol column in the Sequences sheet
    - [x] Sequences sheet
      - [x] Fill in all columns based on name parsed from the sequence name from both the Peak Annotation {Files and Details} sheets
    - [x] Peak Annotation Files sheet
      - ~[ ] Fill in sequence name if there's only one entry in the sequences sheet~ removing the 1 record assumption - this will be accomplished by manual selection from drop-downs
    - [x] Peak Annotation Details sheet
      - Do not autofill either sequence name or mzXML
- Entirely from the peak annotations file
- [x] Compounds sheet: Add compound name and formula for missing compounds to the Compound and Formula column (The HMDB ID column will be highlighted as required)
- Entirely from the Study doc
- [x] Tissues sheet: Add Tissue name for missing tissues to the Tissue column (obtained from the AllMissingTissues exception)
- [x] Treatments sheet: Add Treatment name for missing Treatments to the Treatment Name column (obtained from the AllMissingTreatments exception)
Apply formulas for dynamic drop-down population of:
- Animals
- [x] Treatment (from Treatments sheet)
- [x] Infusate (from Infusates sheet)
- [x] Study (from Study sheet)
- Samples
- [x] Animal
- [x] Tissue
- Infusates
- [x] Tracer Name (from Tracers sheet)
- Peak Annotation Files sheet
- [x] Default Sequence Name
- Peak Annotation Details sheet
- [x] Default Sequence Name
- Sequences sheet
- [x] LC Protocol Name (from the LC Protocols sheet)
Apply Static drop-downs for:
- Study sheet
- ~[ ] Study (possibly by researcher [if we add that to the form or start logging people in])~
- Animals sheet
- [x] Sex (using choices)
- [x] Researcher (using unique list from DB)
- Tracers sheet
- [x] Element (using choices)
- Sequences sheet
- [x] Operator (using unique list from DB)
- [x] Instrument (using choices)
- Peak Annotation Files sheet
- [x] File Format (using DataFormat query from DB)
- Peak Annotation Details sheet
- [x] Skip (using boolean)
Apply dynamically generated static drop-downs (using discrete queries) for:
- [x] Animals - Diet
- [x] Animals - Feeding Status
- [x] Animals - Genotype
Apply formulas for column value calculation
- [x] Infusates sheet - Infusate Name
- [x] Tracers sheet - Tracer Name
- [x] LC Protocols sheet - Name
- [x] Sequences sheet - Sequence Name
- ~[ ] Peak Annotation Details sheet - Skip (if "blank" in the sample header name)~
Decorate columns
- [x] Formula cells (Gray)
- [x] Read-only: light Gray
- ~[ ] With the option for manual entry: lighter gray (e.g. option to paste a tracer or infusate name)~
- ~[ ] Auto-filled cells (Green/Blue)~
- ~[ ] From the DB: light Green?~
- ~[ ] From peak annotation files: light Blue?~
- [x] Required columns (White/Gray)
- [x] Required: light blue? (header only)
- [x] Optional: white? (header only)
- [x] Problems (Yellow/Red)
- [x] Error: light Red
- [x] Warning: light Yellow
[x] Apply data validation formulas (TBD)

Nice-to-have TODO:

[x] Auto-detect study doc version
- [x] Error gracefully when version not supported
- [x] Write a converter
- [x] Animal sample sheet (v2) conversion notes:
  - [x] Rename changed column headers
  - [x] Animals
  - [x] Delete tracer concentrations column
  - [x] Add concentrations to the infusate column
  - [x] Delete Study description column
  - [x] Copy Study name and description to study sheet
  - [x] Renames:
    - [x] Animal Genotype -> Genotype
    - [x] Animal Body Weight -> Weight
    - [x] Animal Treatment -> Treatment
    - [x] Study Name -> Study
    - [x] Animal ID -> Animal Name
  - [x] Samples
  - [x] Renames:
    - [x] Sample Name -> Sample
    - [x] Animal ID -> Animal
    - [x] Researcher Name -> Researcher
  - [x] Tissues
  - [x] Renames:
    - [x] TraceBase Tissue Name -> Tissue

Might be able to:

[x] Change the date format in excel:

writer = pd.ExcelWriter("pandas_datetime.xlsx",
                    engine='xlsxwriter',
                    datetime_format='mmm d yyyy hh:mm:ss',
                    date_format='mmmm dd yyyy')

This issue's remaining items were made into separate issues. I checked off every item that was made into a separate issue. A consolidated list of remaining items was made in #1034 as well (incorporating items from feedback/testing).

Princeton-LSI-ResearchComputing / tracebase