DataBiosphere / data-explorer

BSD 3-Clause "New" or "Revised" License
11 stars 6 forks source link

Search doesn't seem to work well with numeric values #374

Open wnojopra opened 4 years ago

wnojopra commented 4 years ago

For UKB ITT's synthetic data, we tested indexing each genotype as a field using its RSID as the field name. They looked like rs123, rs1234, rs12345, etc. Searching for 'rs' returns the list of genotypes, but searching for 'rs123' didn't seem to work. Even if I changed the field names to include underscores (I think the analyzer tokenizes with characters like underscore and spaces, but not numbers), it didn't seem to help.

This is also visible in 1000 genomes. Searching for 'chr' and 'vcf' returns values like chr_1_vcf, chr_16_vcf, etc, but searching for 'chr 1' returns nothing.

melissachang commented 4 years ago

chr_1 works with 1000 Genomes.

Is there a UKB demo that demonstrates the issue, since it doesn't appear to be an issue with 1000 Genomes?

wnojopra commented 4 years ago

What about chr 1 (a space, not an underscore)? I would expect that to work.

Sorry, I needed to take down the UKB demo explorer. It was fairly expensive to keep up.

But actually, this issue is visible in biobank-explorer. If you click the dropdown, u100040* facets are among the first to populate in the list. But if you do a search for one of these facets, say u10004_0_0, or even u10004, it doesn't show up.