In database building procedure, R dataframes are output into CSV files, and the CSV files are loaded into a PostgreSQL database using COPY FROM SQL command.
The R output CSV format is not completely compatible with PostgreSQL COPY FROM command.
The following case is known to be incompatible:
\.
COPY FROM: "Because backslash is not a special character in the CSV format, ., the end-of-data marker, could also appear as a data value."
readr::write_csv does not quote \. in output CSV.
The following cases need to be further investigated for compatibility:
NULL/NA/missing values
COPY FROM: "The CSV format has no standard way to distinguish a NULL value from an empty string. PostgreSQL's COPY handles this by quoting. A NULL is output as the NULL parameter string and is not quoted, while a non-NULL value matching the NULL parameter string is quoted. For example, with the default settings, a NULL is written as an unquoted empty string, while an empty string data value is written with double quotes (""). Reading values follows similar rules. You can use FORCE_NOT_NULL to prevent NULL input comparisons for specific columns."
readr::write_csv probably writes both NA and empty string as blank in output CSV, with na = "".
Values that are not delimiter/quote/CR/LF/NULL/NA/missing but contain them:
COPY FROM: "If the value contains the delimiter character, the QUOTE character, the NULL string, a carriage return, or line feed character, then the whole value is prefixed and suffixed by the QUOTE character, and any occurrence within the value of a QUOTE character or the ESCAPE character is preceded by the escape character."
readr::write_csv probably quotes such values in CSV output, but edge cases need to be evaluated.
Currently, PostgreSQL COPY FROM compatible CSV files are output by the following function.
The values in ${BULK_EXP_SCHEMA}_${BULK_EXP_TPM_HISTOLOGY_TBL}.csv, which is the only CSV file as of 09/17/2021, are all compatible with PostgreSQL COPY FROM. However, CSV files that will be added in #37 may contain incompatible values.
In database building procedure, R dataframes are output into CSV files, and the CSV files are loaded into a PostgreSQL database using
COPY FROM
SQL command.The R output CSV format is not completely compatible with PostgreSQL
COPY FROM
command.The following case is known to be incompatible:
\.
COPY FROM
: "Because backslash is not a special character in the CSV format, ., the end-of-data marker, could also appear as a data value."readr::write_csv
does not quote\.
in output CSV.The following cases need to be further investigated for compatibility:
NULL
/NA
/missing valuesCOPY FROM
: "The CSV format has no standard way to distinguish a NULL value from an empty string. PostgreSQL's COPY handles this by quoting. A NULL is output as the NULL parameter string and is not quoted, while a non-NULL value matching the NULL parameter string is quoted. For example, with the default settings, a NULL is written as an unquoted empty string, while an empty string data value is written with double quotes (""). Reading values follows similar rules. You can use FORCE_NOT_NULL to prevent NULL input comparisons for specific columns."readr::write_csv
probably writes bothNA
and empty string as blank in output CSV, withna = ""
.NULL
/NA
/missing but contain them:COPY FROM
: "If the value contains the delimiter character, the QUOTE character, the NULL string, a carriage return, or line feed character, then the whole value is prefixed and suffixed by the QUOTE character, and any occurrence within the value of a QUOTE character or the ESCAPE character is preceded by the escape character."readr::write_csv
probably quotes such values in CSV output, but edge cases need to be evaluated.Currently, PostgreSQL
COPY FROM
compatible CSV files are output by the following function.https://github.com/PediatricOpenTargets/OpenPedCan-api/blob/c165d95cd4826ef372c2f1267d0e9e0f8f1f030f/db/build_tools/build_db.R#L153-L157
The values in
${BULK_EXP_SCHEMA}_${BULK_EXP_TPM_HISTOLOGY_TBL}.csv
, which is the only CSV file as of 09/17/2021, are all compatible with PostgreSQLCOPY FROM
. However, CSV files that will be added in #37 may contain incompatible values.Refs:
COPY FROM
specifications: https://www.postgresql.org/docs/11/sql-copy.htmlreadr::write_csv
latest version specifications: https://readr.tidyverse.org/reference/write_delim.html. The local database building Docker image usesreadr
version 2.0.0.