Open johrstrom opened 2 years ago
This file gets filtered down from 1.7 TB to 2.5 GB. Which means we only use about ~14% of gbif data.
2.5G /fs/project/PAS1604/gbif/0147211-200613084148143.filtered.txt
This is handled in this file (or some variation of the file as I'm about to rename it).
This ticket will now be to document this filtering.
I'm linking gbif data, and the 1.7 TB file doesn't seem to be anything that gets through a regex for the id.
I see around there's
filtered
in the name of some files, but I don't know how they got filtered.