Instead of using taxID_to_GIs.pl to create files of subsets of GIs for all families from a higher-level taxon like 8948, instead just use taxID_to_GIs.pl to get the taxon identifiers of all families, then iteratively feed these identifiers to subset_mito_db.pl in order to create sub-databases for mitochondrion and nucleus for every family of the higher taxon ID.
This makes part of the functionality of taxID_to_GIs.pl redundant. We no longer need it to create GI lists at all, since we are getting the GI lists from Entrez. Using this method we are ignoring the data from gi_taxid_nucl.dmp.gz.
My concern with this new method is that it this way we are making way more calls to Entrez (3 for each family--one for nuclear, one for mitochondrion (not full genome) and one for mitochondrion (full genome) ) over the interwebs, which might not be as robust. I think the proposed is more straightforward and simple, however. Worth a shot.
I will be interested to see which method is faster (I'm guessing the first).
Instead of using taxID_to_GIs.pl to create files of subsets of GIs for all families from a higher-level taxon like 8948, instead just use taxID_to_GIs.pl to get the taxon identifiers of all families, then iteratively feed these identifiers to subset_mito_db.pl in order to create sub-databases for mitochondrion and nucleus for every family of the higher taxon ID.
This makes part of the functionality of taxID_to_GIs.pl redundant. We no longer need it to create GI lists at all, since we are getting the GI lists from Entrez. Using this method we are ignoring the data from gi_taxid_nucl.dmp.gz.
My concern with this new method is that it this way we are making way more calls to Entrez (3 for each family--one for nuclear, one for mitochondrion (not full genome) and one for mitochondrion (full genome) ) over the interwebs, which might not be as robust. I think the proposed is more straightforward and simple, however. Worth a shot.
I will be interested to see which method is faster (I'm guessing the first).