Closed HayleyMills closed 1 month ago
The _tv.txt file is being produced because some rows in the first worksheet have a single space character in the dataset prefix column. That's easy to fix, I'll just add a line of code to trim whitespace from those cells.
The mcs_07_ts_s_qv.txt file is empty because all the rows with the questionnaire prefix 'mcs_07_ts_s' don't have valid values in all the relevant columns (e.g. the 'dataset prefix' and 'variable name' cells contain either whitespace or are empty). The code has enough information to create the file, but it doesn't have the information it would need to populate the file with rows.
This touches on an issue I raised in https://github.com/CLOSER-Cohorts/archivist-utilities/issues/42. If the input uploaded to the web application is invalid in some way, do we process it anyway (even if it results in quirks like the empty mcs_07_ts_s_qv.txt file), process it partially (i.e. don't create the mcs_07_ts_s_qv.txt file, but create all other files that we have enough info to create) or reject it outright as invalid, i.e. adopt the attitude that unless everything in the input is right, we don't do anything with the input, and if there's an anomaly/error in the input we just display an error message informing the user of the error(s).
@ollylucl thanks that all makes sense, we can make a decision about this, but the reason why I'm confused is that I thought "Any rows which do not contain all the columns listed below will not be present in the output." "Any rows with cells containing 'NA' or 'Derived', or cells that are empty, will not be present in the output, as they do not contain sufficient information." an I misunderstanding this?
I've dealt with this issue in https://github.com/CLOSER-Cohorts/archivist-utilities/pull/49.
With regards to this comment:
@ollylucl thanks that all makes sense, we can make a decision about this, but the reason why I'm confused is that I thought "Any rows which do not contain all the columns listed below will not be present in the output." "Any rows with cells containing 'NA' or 'Derived', or cells that are empty, will not be present in the output, as they do not contain sufficient information." an I misunderstanding this?
...the README accurately describes the behaviour we want, but the code wasn't trimming/stripping whitespace from spreadsheet cells. When cells contained whitespace (e.g. a single space) it caused the sort of odd behaviour described in the first post of this issue. A cell that contains nothing but whitespace should be treated the same as an empty cell.
I'll close this issue, but feel free to re-open it if you think we need to do more to address the issue.
I'm confused as to what circumstances an empty tv.txt would be produced?
mcs4_all_mappings_July24_AP_hm.xlsx __tv.txt mcs_07_ts_s_qv.txt