information we need in nodes.dmp: 1st, 2nd, and 3rd columns
1st column: taxonomic node id
2nd column: parent of the node
3rd column: taxonomic rank
information we need in names.dmp: 1st, 2nd, and 3rd columns
1st column: taxonomic node id
2nd column: node name
3rd column: only use the scientific names, ignore the rest
the files are not sorted hierarchically but rather sorted by the id. This means a node that is lower in the tree can appear before its parent or even before its higher level ancestors. You need to account for that when parsing. Don't assume the list will go top to bottom. It's all mixed up.
don't push these files as they are very big (I think GitHub won't allow anyway or if everyone pulls we're over monthly data usage limit), we must have another way to store (or download them in the background) later.
the file format is unfortunately very ugly... The fields are separated by tab, but there are also those pipe characters
tests: use a subset of the files to test. Or write a custom string and format it according to the format.
Please ask questions or drop by my office if you have any problems.
Here are some notes as we didn't have much time to talk about this.
download taxdump.tar.gz or taxdump.zip from: ftp://ftp.ncbi.nih.gov/pub/taxonomy
files we need: names.dmp and nodes.dmp
information we need in nodes.dmp: 1st, 2nd, and 3rd columns
information we need in names.dmp: 1st, 2nd, and 3rd columns
the files are not sorted hierarchically but rather sorted by the id. This means a node that is lower in the tree can appear before its parent or even before its higher level ancestors. You need to account for that when parsing. Don't assume the list will go top to bottom. It's all mixed up.
don't push these files as they are very big (I think GitHub won't allow anyway or if everyone pulls we're over monthly data usage limit), we must have another way to store (or download them in the background) later.
the file format is unfortunately very ugly... The fields are separated by tab, but there are also those pipe characters
tests: use a subset of the files to test. Or write a custom string and format it according to the format.
Please ask questions or drop by my office if you have any problems.