Closed hepcat72 closed 2 months ago
Just had our developer meeting, where we talked about a proof-of-concept test interface I created on a branch named dropzone_paths
. I realized that given the concept of developing the functionality incrementally is appropriate given the immediate need. And I started thinking about incorporating a web interface version of #888 into this overall plan. I think that, as a component of this issue, and as per the points discussed in the meeting, there are a few ways this issue can be broken up or are otherwise worth noting:
study.xlsx
, but this could be implemented on the current animal/sample sheet and then be updated to handle the new formatdropzone_paths
, either as:_pos
, etc.)--lcms-file
as a tsv)Given the considerations above, the following is the path I've devised to get to a resolution of issue #888.
Breaking up issue #829:
1.
Integrate the changes in the dropzone_paths
branch into the Validation Interface to be able to take an mzXML file list2.
Create a method in the ValidationView to build an LCMS file given the mzXML file name list and an accucor file3.
FINAL OUTPUT FOR ISSUE #888: Provide the LCMS file for download in the ValidationView. The issues that follow this one are to work toward a more polished result4.
OPTIONAL [DO LATER?]: Create a method that takes a sample header list and a list of possible suffixes (_pos, _neg, _scan1, etc) and returns a dict of accucor sample header to db sample name (called by the method in step 3 above that generates the LCMS file)5.
Update the ValidationView to supply the LCMS file to the loader~6.
Update the accucor load script to create a unique placeholder hash for mzXML files from the LCMS metadata file (and create placeholder ArchiveFile records) when in --validate
mode~7.
Make the ValidationView catch NoSamplesError
and MissingSamplesError
(add the list of sample names to the exceptions, if not already there) and pass the sample names and LCMS file to a (placeholder) method (create_or_update_study_doc
) that will create (or update) the animals/samples excel doc~8.
OPTIONAL [DO LATER?]: Make the ValidationView remove NoSamplesError
and MissingSamplesError
s from the rendered view~9.
Implement the create_or_update_study_doc
method to generate a file with sample names added~10.
OPTIONAL [DO LATER?]: Make the create_or_update_study_doc
method able to highlight required empty fields~11.
OPTIONAL [DO LATER?]: Make the create_or_update_study_doc
method able to Add placeholders for other required fields (like Animal Name, etc). Initially, every sample would be added to an "undefined" or "unknown" animal field that is highlighted for edit (and make the non-validate mode raise an error if that undefined value is encountered)~12.
Make the ValidationView provide a link to download the created study excel doc (with some stats showing the number of created samples, etc)~13.
OPTIONAL/TEMPORARY: Make the ValidationView able to (temporarily) generate the LCMS tsv file, provide it for download along with the study doc, and take it as input. This is a temporary issue so that users can revalidate data after implementing fixes. It will be replaced by the ability of the loading scripts to read the LCMS data from the excel study doc (covered in issues #825 and #829)~14.
Rebrand the Validation Interface into a "Build a Submission" interface~See this comment for an updated plan.
I intend to change the above plan. Important notes from slack:
Just as a preface, I think not including mzXMLs in the intermediate solution is copacetic with our overall goal of lowering the hurdles to submission. And while specifying a directory could be easier in some cases, we know that not everyone organizes their data in the same way, so it wouldn't make it easier in every case. We also know that the only time we need the user to go to the effort of mapping mzXMLs to samples in specific accucor files is when there are multiple scans of the same sample - but if the file names are identical in those cases, this interface doesn't solve that (since it can't include the directory).
In fact, I don't think there's a whole lot we can do for the user in the submission compilation process if there are multiple identically named files to be divided among multiple accucors, other than point out that there's no name match. After submission, we may be able to match things up by looking inside the files.
OK, so then I propose the following:
- Remove the mzXML drag and drop feature (which I think we agree on)
- Output an excel file using the current animal/sample template, with samples filled in
- I think it's worthwhile to also allow an animal/sample sheet input (so that samples can be added). I still think that just rebranding the validation page, limiting it to 1 of each file (animal/sample and peak annotation), and removing errors related to this use case (no animal/sample sheet) is the way to go, but if I can't convince you of that, I'll skip this and make a separate page if you insist. I'm just not convinced that that goes in the right direction.
- Make the peak annotation input a single field.
The other alternative I'd proposed earlier, of getting the LCMS loader done first, is moot if we're using the old template.
And later...
I did spend a few minutes BTW just now, polishing off the commit that I was 98% toward yesterday. I'll post a PR, even though some of the decisions above revert aspects of it. I'm also having second thoughts about my recommendation of limiting the interface to 1 of each (of the 2) file type(s), at least in the short term.
There's still the problem of timeouts if we try to validate the data and look for errors, but I realized that, if not given a sample sheet, we could simply not try to look for errors at all, and that would make it fast.
And while I was looking at the code just now, I realized that the reason I got on board with 1 at a time strategy was because of the identically named mzXML issue. By removing that as an obstacle, I don't see any reason we can't take a series of accucor/isocorr and output a list of all the samples in a single sample sheet.
I haven't fully fleshed this idea out, but it's enough to make me want to propose that we at least postpone implementing the limit in the next few commits/PRs. I say, that can be another incremental step, after the next meeting, and after having worked with the code.
I think that my current thinking is to change the validation page to
Just some notes on creating an excel file in code...
https://stackoverflow.com/a/13437855/2057516
Example:
df.to_excel('test.xlsx', sheet_name='sheet1', index=False)
# Do stuff, e.g.:
# - Add a color to cells: `df.style.applymap(lambda val: "color: %s" % ("red" if val < 0 else "green"))`
writer = pd.ExcelWriter("pandas_simple.xlsx", engine="xlsxwriter")
worksheet = writer.sheets["Sheet1"]
# Do stuff, e.g.:
# - Add a comment to a cell: `worksheet.write_comment("A1", "This is a comment", {"visible": True})`
# - Add a formula to a cell: `worksheet.write_formula('A1', '=SUM(1, 2, 3)')`
writer.close()
Status of features:
pd.ExcelWriter
, though they use the xlsxwriter engine. May be better to stick with openpyxl if possible. xlsxwriter exampledf.style.applymap(lambda val: "color: %s" % ("red" if val < 0 else "green"))
sourcewriter = pd.ExcelWriter("pandas_simple.xlsx", engine="xlsxwriter")
workbook = writer.book
worksheet = writer.sheets["Sheet1"]
# Formats
missing_format = workbook.add_format({'bold': True, 'color': '#FF0000'})
error_format = workbook.add_format({'bold': True, 'color': 'red'})
date_format = workbook.add_format({'num_format': 'mmm d yyyy hh:mm AM/PM'})
# Example:
worksheet.write(0, 0, 'Foo', cell_format)
worksheet.write_comment(row=0, col=0, "comment")
worksheet.write(0, 0, 'Foo', cell_format)
worksheet.write_string(1, 0, 'Bar', cell_format)
worksheet.write_number(2, 0, 3, cell_format)
worksheet.write_blank (3, 0, '', cell_format)
worksheet.write_formula(4, 0, "=SUM(1, 2, 3)", cell_format)
worksheet.write_datetime(0, 0, datetime, date_format)
worksheet.write_boolean()
worksheet.write_url()
# Set a whole row/col
worksheet.set_row(0, 18, cell_format)
worksheet.set_column('A:D', 20, cell_format)
# Extra stuff once I get something simple working:
worksheet.data_validation('A1', {'validate': 'integer',
'criteria': '>',
'value': 100})
worksheet.conditional_format('B3:K12', {'type': 'cell',
'criteria': '>=',
'value': 50,
'format': format1})
worksheet.conditional_format('A1:A4', {'type': 'errors',
'format': format1})
worksheet.conditional_format('A2:C9' ,
{'type': 'formula',
'criteria': '=OR($B2<$C2,AND($B2="",$C2>TODAY()))',
'format': format1
})
New plan:
1.
Integrate the changes in the dropzone_paths
branch into the Validation Interface to be able to take an mzXML file list2.
Create a method in the ValidationView to build an LCMS file given the mzXML file name list and an accucor file3.
FINAL OUTPUT FOR ISSUE #888: Provide the LCMS file for download in the ValidationView. The issues that follow this one are to work toward a more polished result4.
OPTIONAL [DO LATER?]: Create a method that takes a sample header list and a list of possible suffixes (_pos, _neg, _scan1, etc) and returns a dict of accucor sample header to db sample name (called by the method in step 3 above that generates the LCMS file)5.
Remove the mzXML file input.6.
infer accucor/isocorr7.
Change "Validate" button to "Download Submission Template" button8.
Catch NoSamplesError
and MissingSamplesError
in the ValidationView and add the list of sample names to the exceptions, if not already there.9.
Fill in sample names
9.1.
If no sample sheet provided, create dataframes for each sheet, otherwise, retrieve the dataframes of the supplied animal/sample doc9.2.
Catch exceptions containing data (AllMissingSampelsError
) and make updates/additions to the dataframe9.3.
Create an excel file (given multiple peak annotation files) with sample names filled in. Call it with the sample names extracted from the sample errors. Have it use the method that extracts tracebase sample names from the sample headers.~ This is unnecessary. The file shouldn't hit disk. Just stream it to the user's downloads.10.
Remove NoSamplesError
s, MissingSamplesError
s, and AllMissingSamplesError
11.
Make the ValidationView automatically download the created study excel doc (with some stats showing the number of created samples, etc)12.
Make the errors reported:
12.1.
If a sample table was supplied - initially hidden, but expandable underneath some stats and make any errors handled by autofill into warnings (or eliminate them)12.2.
If a sample table was not supplied - delete any errors associated with added data). If all pass, also delete all of the load statuses and do not present results at all.13.
OPTIONAL [DO LATER?]: Make the create_or_update_study_doc
method able to Add placeholders for other required fields (like Animal Name, etc) and other data. Initially, every sample would be added to an "undefined" or "unknown" animal field that is highlighted for edit (and make the non-validate mode raise an error if that undefined value is encountered)14.
OPTIONAL [DO LATER?]: time- and/or date-stamp the file name15.
Add a method to shortcut the processing when no animal/sample sheet is supplied to only grab data to be added16.
OPTIONAL [DO LATER?]: If the study exists, automatically add all existing data to the spreadsheet (I realized in testing that sample names are not added on some example sets I'd tested with, because they'd already been loaded - so they weren't "missing")17.
OPTIONAL [DO LATER?]: Add an errors "sheet" that lists errors unassociated with a sheet/row/column18.
OPTIONAL [DO LATER?]: In addition to "errors" & "warnings" in the validation report, add a new category: either "info" or "additions" or "updates" or "autofills" or "fixes"...? - any of which could change a FAILED status to PASSED. Perhaps even PASS/FAIL can be changed to some sort of readiness category or a progress status/grid that indicates completeness of: required animal/sample data, suggested animal/samples data,dependent data (tissues/treatments/compounds/etc), run metadata, and a count of errors/warnings state.19.
Update the upload page to reflect the rebranding20.
OPTIONAL [DO LATER?]: Add stats to the results: number of errors, records that would be created, data added, etc.I think that it might be a good idea to focus the validation efforts at either the point where they attempt to submit, or perhaps we can create a way to collect components of a submission, piecemeal, and at each addition, associate it with an error report. ... Still thinking this through ...
Note that the read excel method, when you set the sheet to None, it returns a dict with a dataframe for each sheet
create_or_update_study_doc
: I think the thing to do would be to:
NoSamplesError
)Notes on streaming a file... See my notes above on decorating the spreadsheets.
According to the openpyxl documentation, you can stream the excel content by doing the following (which I modified slightly based on other googling):
from tempfile import NamedTemporaryFile
from openpyxl import Workbook
wb = Workbook()
with NamedTemporaryFile(suffix='.xlsx') as tmp:
wb.save(tmp.name)
tmp.seek(0)
stream = tmp.read()
Other notes:
# Create sheets
ws1 = wb.create_sheet("Mysheet") # insert at the end (default)
ws2 = wb.create_sheet("Mysheet", 0) # insert at first position
# Set sheet name
ws1.title = "New Title"
I was a little confused about how I get from the pandas dataframes to the stream, so I found this stack post, which says:
import pandas as pd
from django.http import HttpResponse
from io import BytesIO
excel_file = BytesIO()
xlwriter = pd.ExcelWriter(excel_file, engine='xlsxwriter')
# I modified the answer to put it in terms of my `dfs_dict`, which can have a dataframe keyed on sheets
df_output = {}
for sheet in dfs_dict.keys():
df_output[sheet] = pd.DataFrame.from_dict(dfs_dict[sheet])
df_output[sheet].to_excel(xlwriter, sheet)
xlwriter.save()
xlwriter.close()
# important step, rewind the buffer or when it is read() you'll get nothing
# but an error message when you try to open your zero length file in Excel
excel_file.seek(0)
# set the mime type so that the browser knows what to do with the file
response = HttpResponse(excel_file.read(), content_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
# set the file name in the Content-Disposition header
response['Content-Disposition'] = 'attachment; filename=myfile.xlsx'
return response
So I realized that there's a minor problem. I learned that you can't both render_to_response
and HttpResponse(excel_file...)
in 1 go. I was planning to both kick off the download as a stream (never hitting disk) and still be able to render an error report (if any errors exist), at the same time.
Since that's not possible, unless I figure out a way to preserve the file data (say, embed it in the template), I will have to save a temp file and then use javascript to trigger its download in the resulting page.
I wonder if embedding is feasible, the way I created the download of the lcms file...
Nice-to-haves (just documenting ideas - not necessarily adding to the implementation plan)
TODO:
Going to augment the existing DataValidationView as a step toward an end-product, instead of crafting a new class that the view just uses. There is a lot of existing code and experimentation with what's possible will yield a better overall understanding of the elements involved that will inform a refactor. It also lets me get the code out faster.
[x] Replace the animals properties with the AnimalsLoader calls (like how the treatments and tissues are handled)
[x] Replace the samples properties with the SamplesLoader calls (like how the treatments and tissues are handled)
Add sheets
Apply header comments to:
Auto-populate sheets
Apply formulas for dynamic drop-down population of:
Apply Static drop-downs for:
Apply dynamically generated static drop-downs (using discrete queries) for:
Apply formulas for column value calculation
Decorate columns
[x] Apply data validation formulas (TBD)
Nice-to-have TODO:
Might be able to:
writer = pd.ExcelWriter("pandas_datetime.xlsx",
engine='xlsxwriter',
datetime_format='mmm d yyyy hh:mm:ss',
date_format='mmmm dd yyyy')
This issue's remaining items were made into separate issues. I checked off every item that was made into a separate issue. A consolidated list of remaining items was made in #1034 as well (incorporating items from feedback/testing).
FEATURE REQUEST
Inspiration
Issue #753 pretty much describes this (which is its end-goal), but basically, pre-populating as much study data as possible based on accucor files will really speed up compilation of a submission.
Description
Based on my submission process proposal from March and compiled and annotated in my full proposal, I think that we should use the version 3.0 effort as an opportunity to streamline the sheets in the excel file like I have in my (process) proposal:
Much of the study doc will be pre-populated (see discussion) by the validation interface. All the validation interface will require will be:
Optional additional inputs for validation:
The the interface section in the design below.
Goals:
sample1_pos
being the same assample1
).Alternatives
None
Dependencies
Parent issue-tracking issue:
753
Comment
I created an example version of the Study Excel doc:
study.xlsx
ISSUE OWNER SECTION
Assumptions
None
Limitations
None
Affected Components
validation.py
Requirements
Requirements added to what was copied from #753:
1.
Users must be allowed to add compounds, tracers, infusates, tissues, protocols without involving curator correspondence, with the following caveats:1.1.
New such additions must be flagged for curator review in some way. I.e. the curator must be made to explicitly approve these additions in some way.1.2.
The data (all compounds, tracers, etc) sheets must be removed from the study sheet and added to the corresponding consolidated docsThe requirements from #753:
TODO: Be sure that the requirements above cover the different behaviors in validate mode (output a copy of the study doc that populates sheets like Compound, Treatment/Protocol, Tracer, Infusate, and Tissue with the contents of the database - to serve as an example/guide. It should also do things like add formulas to for example, create drop-downs for the Compound column on the tracers sheet.)
DESIGN
Interface Change description
The process for the end user will go like this:
The process for the curator will go like this:
Code Change Description
Create a pre-population method for each field to pre-populate. Keep track of what fields are required. Keep the pre-population methods independent of the loading scripts. Use heuristics for things like chopping
_pos
off sample names. Only unpopulated fields will be auto-populated.Things like compounds, tissues, and treatments will be populated from existing consolidated data (either adding only data related to what's in the accucor/isocorr file(s) or all data), but after submission, curators will be removing this data after updating the consolidated docs to reduce overhead, though researchers will be allowed to hang onto their fully independent study doc.
Tests
A test for each requirement