RConsortium / submissions-wg

R Submissions Working Group
https://rconsortium.github.io/submissions-wg
46 stars 13 forks source link

Proposal for the Software Programs section 4.1.2.10 #38

Open waddella opened 3 years ago

waddella commented 3 years ago

I am copying over from:

https://github.com/RConsortium/submissions-wg/discussions/7

Motivation

Let's start the discussion about an alternative wording proposal for section 4.1.2.10 in the "FDA STUDY DATA TECHNICAL CONFORMANCE GUIDE" found here:

https://www.fda.gov/media/143550/download

Currently (November 2020) the paragraph reads:

Sponsors should provide the software programs used to create all ADaM datasets and generate tables and figures associated with primary and secondary efficacy analyses. Furthermore, sponsors should submit software programs used to generate additional information included in Section 14 CLINICAL STUDIES of the Prescribing Information, if applicable. The specific software utilized should be specified in the ADRG. Refer to FDA Statistical Software Clarifying Statement for more information.

The main purpose of requesting the submission of these programs is to understand the process by which the variables for the respective analyses were created and to confirm the analysis algorithms and results. Sponsors should submit software programs in ASCII text format. Executable file extensions should not be used.

The goal is to be more specific about the file types that can be submitted as many have interpreted that paragraph that program files need to be submitted as .txt files.

Am I correct that the actual constraint is that the files need to be:

Also would .tar be acceptable for a collection of ASCII files?

Let's start the discussion here, I can update this first message with the agreed solution from the discussion thread.

lengning commented 3 years ago

duplicate of #3 ? closed #3

dgkf commented 3 years ago

Recap of discussion on 2021/09/03

From the discussion today, it sounds like the goal is that code in the file is not automatically executed upon opening (specifically within the context of Windows as it's configured on the Reviewer's machine).

This gets a little tricky, since extensions are nothing more than cues to the operating system to deal with them in some specific way, and that behavior is often configured in user-space of the operating system (using the "Default Apps" menu). If the default app is a shell (windows command line, powershell, or even an R shell), then these would be automatically executed despite being ASCII text format files.

Based on our discussion today, it sounds like the HA reviewer machines have been configured to disallow software from automatically executing code based on an extension, but that .bat files might be an exception, executing by default in the windows command line. This might be a notable exception to the ASCII text format rules.

It sounds like these are an exception to the more generalized ASCII guidance that we should include, and it would be great if any exceptions are narrowly defined. From the HA, it would be nice if we could get a list of file extensions which they would like to black-list (if there are any others aside from .bat).

Proposed Refinement

On linux systems I think it might be easier to draw this line based on the file's executable flag.

Would it be sufficient to say that this ASCII text recommendation should exclude files with extensions that are configured to launch and run in a shell in a fresh install of windows? I think this encompasses .bat, and probably PowerShell (.ps1, .ps2 + a few others) as well, but I would need to get a hold of a Windows machine to be exhaustive.

vmarenny commented 2 years ago

ASCII should probably be supplanted with ASCII or UTF-8. Most modern languages, R and Python3 included, default to UTF8 when saving files.

As for .tar as a collection of files, opens up a nice opportunity for containers (docker), as they can be saved and loaded as tarballs.

kaz-gene-com commented 2 years ago

Note that they explicitly call out define.pdf in which every example I've ever seen is UTF-8. Another place in the same document it notes:

3.3.5 Special Characters: Variables and Datasets

Variable names, as well as variable and dataset labels should include American Standard Code for Information Interchange (ASCII) text codes only. Variable values are the most broadly compatible with software and operating systems when they are restricted to ASCII text codes (printable values below 128). Use UTF-8 for extending character sets; however, the use of extended mappings is not recommended. Transcoding errors, variable length errors, and lack of software support for multi byte UTF-8 encodings can result in incorrect character display and variable value truncations. Ensure that LBSTRESC and controlled terminology extensions in LBTEST do not contain byte values 160-191 as some character mappings in that range may interfere with agency processes.

emphasis mine

Thus I think the idea is that avoid using UTF-8 and do NOT use in variable names/LBSTREC/LBTEST codelists.

A few other points about "binaries". The FDA uses binary format in many accepted files, the major one being PDFs (which often include zip type compression schemes internally). XPT files can contain floating points in IEEE (or IBM) format with big endian or little endian and that's definitely not ASCII text.

I suspect another reason for the deprecation of UTF-8 is fonts as the machines have limited installed fonts (times new roman, arial, courier new, symbol, and zapf dingbats; see the list in https://www.fda.gov/media/76797/download) so PDFs need to include fonts for anything but the approved ones.

In any event I think it's a conversation with a reviewer/CBER (or is it CDER?) IT.

dgkf commented 2 years ago

Thanks @vmarenny & @kaz-gene-com

The .tar option and the discussion about compressed pdf contents might open up a door for us to consider a .tar.gz/.tgz/.zip, which are standard formats for distributing R packages (on unix/mac/windows respectively).

Whether these are permitted as source code for a submission is one question (as they can be decompressed and extracted as plain-text files), and whether these files would be permissible in the submission portal is another open question. We discussed this option at today's meeting and plan to use the eCTD testing portal to get feedback on the viability of submitting a .tar.gz/.tgz/.zip R package source code bundle.

This would make it much easier to install these packages on the receiving end of a submission, allowing a reviewer to simply use install.packages instead of having to first build the packages themselves.

sclewis23 commented 2 years ago

Thanks @vmarenny & @kaz-gene-com

The .tar option and the discussion about compressed pdf contents might open up a door for us to consider a .tar.gz/.tgz/.zip, which are standard formats for distributing R packages (on unix/mac/windows respectively).

Whether these are permitted as source code for a submission is one question (as they can be decompressed and extracted as plain-text files), and whether these files would be permissible in the submission portal is another open question. We discussed this option at today's meeting and plan to use the eCTD testing portal to get feedback on the viability of submitting a .tar.gz/.tgz/.zip R package source code bundle.

This would make it much easier to install these packages on the receiving end of a submission, allowing a reviewer to simply use install.packages instead of having to first build the packages themselves.

So is it acceptable to use R Source files like .tar.gz in submissions? Or do they have to be ASCII files?