GenSpectrum / LAPIS-SILO

Sequence Indexing engine for Large Order of genomic data
GNU Affero General Public License v3.0
12 stars 3 forks source link

Quotes in field names cause preprocessing errors #595

Open tsibley opened 4 days ago

tsibley commented 4 days ago

Field names are not properly quoted when generating the queries for DuckDB. They're surrounded by double quotes without escaping (by doubling) any double quotes already in the name.

https://github.com/GenSpectrum/LAPIS-SILO/blob/fb58525de5d978075defbe4a22d5a3f37dbb6d44/src/silo/preprocessing/metadata_info.cpp#L160-L191

It looks like the code above should be using Identifier::escapeIdentifier() instead, although I suspect doing so will require other downstream changes in calling code.

I noticed this when reading the preprocessing code and then verified it with the attached test input directory and the following command:

$ docker run --rm -v "$PWD"/testBaseData/tsvWithQuoteInFieldName:/preprocessing/input:ro -v "$PWD"/testBaseData/output:/preprocessing/output:rw -v "$PWD"/testBaseData/tsvWithQuoteInFieldName/preprocessing_config.yaml:/app/preprocessing_config.yaml:ro -v "$PWD"/testBaseData/tsvWithQuoteInFieldName/database_config.yaml:/app/database_config.yaml:ro -v "$PWD"/testBaseData/logs:/app/logs:rw ghcr.io/genspectrum/lapis-silo:0.2.21 --preprocessing
[2024-09-24 18:13:24.355] [logger] [info] [api.cpp:316] Starting SILO
[2024-09-24 18:13:24.356] [logger] [info] [api.cpp:274] Starting SILO preprocessing
[2024-09-24 18:13:24.356] [logger] [info] [yaml_file.cpp:11] Reading config from ./default_preprocessing_config.yaml
[2024-09-24 18:13:24.356] [logger] [info] [yaml_file.cpp:11] Reading config from ./preprocessing_config.yaml
[2024-09-24 18:13:24.356] [logger] [info] [api.cpp:72] Resulting preprocessing config: { input directory: '/preprocessing/input/', pango_lineage_definition_file: none, output_directory: '/preprocessing/output/', metadata_file: ''metadata.tsv'', reference_genome_file: 'reference_genomes.json',  gene_file_prefix: 'gene_',  nucleotide_sequence_file_prefix: 'nuc_', unaligned_nucleotide_sequence_file_prefix: 'unaligned_', ndjson_filename: none, preprocessing_database_location: none }
[2024-09-24 18:13:24.356] [logger] [info] [database_config.cpp:211] Reading database config from database_config.yaml
[2024-09-24 18:13:24.356] [logger] [info] [api.cpp:257] preprocessing - reading reference genome
[2024-09-24 18:13:24.356] [logger] [info] [reference_genomes.cpp:159] Read reference genomes from file: /preprocessing/input/reference_genomes.json
[2024-09-24 18:13:24.356] [logger] [info] [api.cpp:261] preprocessing - reading pango lineage alias
[2024-09-24 18:13:24.356] [logger] [info] [pango_lineage_alias.cpp:106] No pango lineage alias file provided. Using empty alias lookup.
[2024-09-24 18:13:24.363] [logger] [info] [preprocessor.cpp:85] preprocessing - creating intermediate results directory '/preprocessing/temp/'
[2024-09-24 18:13:24.363] [logger] [info] [preprocessor.cpp:115] preprocessing - classic metadata file pipeline chosen
[2024-09-24 18:13:24.364] [logger] [error] [api.cpp:282] Parser Error: unterminated quoted identifier at or near "" VARCHAR);"
LINE 1: ... TABLE metadata_table("id" VARCHAR,"x"y" VARCHAR);
                                                  ^
[2024-09-24 18:13:24.364] [logger] [info] [api.cpp:321] Stopping SILO
std::exception
Taepper commented 4 days ago

Thank you for raising this issue! We really should have had this covered by test cases already..