Open heathercole opened 3 years ago
For DAO, the primary identifier is the barcode number (this truly internally unique number is to be considered the catalogue number). We must also record historic DAO accession numbers (these numbers are not guaranteed to be internally unique, but for most of the herbarium's history served as the catalogue number).
For DAOM, the primary identifier is a truly internally unique incrementing catalogue number assigned by collections staff. The barcode (also truly internally unique) is a secondary identifier. I want to investigate making the barcode the primary identifier in the future for new acquisitions.
For CNC:
Specimen ID: Similar to a catalog number or DAO's barcode; is a unique number, and the primary identifier. It is how we find specimens in the collection (this number is physically on the specimen in almost all cases). This number needs to be prominent in the record.
Collection Event ID: A number that is related to a Collection Event (the where, when, who and how an organism was collected). This number is used to auto-populate the data of specimen records by entering this number into the record and saving it (thereby auto-populating relevant data fields). Many Specimen IDs can use one Collecting Event ID.
Dissection ID: A number that would be put on a dissected part of a specimen that is kept separate from the main specimen. Needs to be unique amongst this field of numbers. The main users of this is our Lepidoptera collections. Christi Jaeger can be consulted more specific details. This is always related to a Specimen ID and is kind of like a subsample or a piece of the organism.
Other ID: A catch all for other numbers that are unique or semi-unique. Examples of things currently found in this field:
Molecular Sequence Numbers: Scientist specific and not stored in CNCDB. However it seems like we are trying to head toward storing them in DINA. We have BOLD and some GenBank numbers in the database but as scientists sequence material in house, they do not keep track of this in our database. Looking to the future, this would be good to talk to them about how they would use this function if available to them so that we could store sequence info for our specimens.
For CCAMF:
Database ID: That is the unique identifier for the record in the database, it is auto-generated. We never ever use it or look up our specimens using that number but it's there.
Collection Event ID: A number that is related to a Collection Event (the where, when, who and how an organism was collected). This number is used to auto-populate the data of specimen records by entering this number into the record and saving it (thereby auto-populating relevant data fields). Many Specimen IDs can use one Collecting Event ID.
Specimen ID: the primary identifier. It is how we find specimens in the collection (this number is physically on the specimen). This number needs to be prominent in the record. In our cases it is unique in the sense that not 2 different strains would have the same number BUT we add versions to that number. ex. : 1000A was made in 2005, 1000B was made in 2007 from 1000A, etc...
Collection ID: not an identifier but I think it's worth mentioning that our collection is separated in 2 : INVIVO and INVITRO and that prefix is in front of our Specimens ID on our labels. We can filter to only see one or the other and Specimen IDs can be duplicate, I have INVIVO1000A and INVITRO1000A
DAOM number: IF that specimen is pure and a slide has been deposited to DAOM then we have that barcode number associated at the specimen level, not all specimens have been deposited.
Other ID: Usually used for specimens received from other collections, recording their ID for that strain.
Slide number: the number of the microscopic slide that was made with spores from that specimen.
Molecular Sequence Numbers: Depending on how the molecular module works but we do keep track of this, currently in a notes field but we would need to have a specific field for that.
Official CCFC
Identifiers ie those with a function in the database
Every human readable identifier listed below has a unique
identifier associated with it from the SeqDB database. In most cases these are
invisible but can be discoverable through the web addresses for specific
records or in some summary tables these are displayed.
These unique identifiers are what the 2D barcodes from SeqDB
encode. This means that it is very important for these to be maintained in the new
system. The reason is that we have spent the last 5 years labelling every item
in the collection with these barcodes to allow for remote activities on the
items using barcode readers. We use the barcode readers to initiate processes
on these items and this interfaces with the database to update status on the
items.
Specimen
Identifiers
The historical application of identifiers in CCFC has been
less than rigorous. This means we have a number of formats of identifiers which
are catalogue numbers. In this case I mean the names applied that have been
published. They include the following:
“Collection Code” “Collection number”: DAOMC 123456
“Collection Code” until 3 years ago
was DAOM we changed it to DAOMC to avoid confusion with the herbarium
collection going forward. Many identifiers are shared between the two
collections and represent the same organisms. THIS IS SOMETHING I HOPE WE CAN
EXPLOIT IN THE NEW SYSTEM! I would love for the records to be linked at least.
“Collection number”:Can range from
2 to 6 digits.
“Collection Code” “Collection number” “Sub ID”: Inheritited
collections from some researchers have kept their original collector numbers as
“Collection number”. These are further separated out by a second or addendum to
the “Collection Code”. Due to the limitations of the “Specimen Identifier”
fields in SeqDB that are just numerical these were added in that data set as a
SubID.
Published Identifier | Identifier in Database |
---|---|
DAOMC BR 144 or DAOM BR 144 or BR 144 | DAOMC 144 BR |
DAOMC F1353 or DAOM F1353 or F1353 | DAOMC 1353 F |
DAOMC G1961 or DAOM G1961 or G1961 | DAOMC 1961 G |
The rest of our identifiers are derived from this specimen
identifier and include the following:
Specimen Replicate
Name: concatenation of “Collection Code” “Collection number” “Sub ID” with
a version added = DAOMC 123456 A, DAOMC1351FA or DAOMC1961GA
Sample Name: concatenation
of “Collection Code” “Collection number” “Sub ID” with a version added = DAOMC
123456 A, DAOMC1351FA or DAOMC1961GA
Yes the human readable format of these two identifiers is
identical, but they do have unique identifiers in the back end. And in the
physical world the item labelled would hardly ever be confused when it comes to
our workflows. Pretty easy to see the difference between a fungal culture and a
DNA extract.
Sequence Name:
There is no enforced formatting on these names but we follow these suggestions
Raw sequences: Sample
Name_Experimenter_Batch_GeneRegion_Primer
Consensus sequences: Sample
Name_Genus_species_GeneTarget
PCR Batches and
Sequence Batches: free form names
Flow and connectivity
of human readable names
Official CCFC
Identifiers ie those with a function in the database
Every human readable identifier listed below has a unique
identifier associated with it from the SeqDB database. In most cases these are
invisible but can be discoverable through the web addresses for specific
records or in some summary tables these are displayed.
These unique identifiers are what the 2D barcodes from SeqDB
encode. This means that it is very important for these to be maintained in the new
system. The reason is that we have spent the last 5 years labelling every item
in the collection with these barcodes to allow for remote activities on the
items using barcode readers. We use the barcode readers to initiate processes
on these items and this interfaces with the database to update status on the
items.
Specimen
Identifiers
The historical application of identifiers in CCFC has been
less than rigorous. This means we have a number of formats of identifiers which
are catalogue numbers. In this case I mean the names applied that have been
published. They include the following:
“Collection Code” “Collection number”: DAOMC 123456
“Collection Code” until 3 years ago
was DAOM we changed it to DAOMC to avoid confusion with the herbarium
collection going forward. Many identifiers are shared between the two
collections and represent the same organisms. THIS IS SOMETHING I HOPE WE CAN
EXPLOIT IN THE NEW SYSTEM! I would love for the records to be linked at least.
“Collection number”:Can range from
2 to 6 digits.
“Collection Code” “Collection number” “Sub ID”: Inheritited
collections from some researchers have kept their original collector numbers as
“Collection number”. These are further separated out by a second or addendum to
the “Collection Code”. Due to the limitations of the “Specimen Identifier”
fields in SeqDB that are just numerical these were added in that data set as a
SubID.
Published Identifier | Identifier in Database |
---|---|
DAOMC BR 144 or DAOM BR 144 or BR 144 | DAOMC 144 BR |
DAOMC F1353 or DAOM F1353 or F1353 | DAOMC 1353 F |
DAOMC G1961 or DAOM G1961 or G1961 | DAOMC 1961 G |
The rest of our identifiers are derived from this specimen
identifier and include the following:
Specimen Replicate
Name: concatenation of “Collection Code” “Collection number” “Sub ID” with
a version added = DAOMC 123456 A, DAOMC1351FA or DAOMC1961GA
Sample Name: concatenation
of “Collection Code” “Collection number” “Sub ID” with a version added = DAOMC
123456 A, DAOMC1351FA or DAOMC1961GA
Yes the human readable format of these two identifiers is
identical, but they do have unique identifiers in the back end. And in the
physical world the item labelled would hardly ever be confused when it comes to
our workflows. Pretty easy to see the difference between a fungal culture and a
DNA extract.
Sequence Name:
There is no enforced formatting on these names but we follow these suggestions
Raw sequences: Sample
Name_Experimenter_Batch_GeneRegion_Primer
Consensus sequences: Sample
Name_Genus_species_GeneTarget
PCR Batches and
Sequence Batches: free form names
Flow and connectivity
of human readable names see at bottome of message
We also have other names which are associated with unique
identifiers that also use our barcode labelling system. These include:
Storage Units:
Free form naming system
Storage Containers:
Free form naming system
Other Identifiers
tracked as text fields currently
“Other IDs” all other known identifiers associated with the
organims, “DAOM Number” the herbarium catalogue number, “Isolate Number” this
in most cases is equal to the Collection or Collector Number as discussed in
collecting events, “CCFC Number” – these are hangover numbers never used
publicly from a previous database attempt, “Institution Code” and “Permanent
Collection Code” these are for reference to other recognised culture
collections
For the CPVC it is as follows: Primary identifiers:
CPVC # = catalog # = accession # … text & numbers & dashes (or underscores) .. this will gain ‘extensions’ over time as rejuvenations are done ..
Unknown # = material sample without taxonomic name .. different than CPVC # .. needs to be able to transition to CPVC # (once identified with taxonomic name)
Barcode # = This is new to the CPVC, but something we have been planning for a couple of years. We are thinking a combination of barcodes and QR codes depending on the item in question. For example, trees in the field will have QR codes that can be scanned and will bring up all relevant data on what is in that particular host and the organism(s) of interest inside of it. Items such as RNA samples and clones will have barcodes to help track them.
Secondary identifiers:
Acquisition # = number automatically given to a new material sample (must be associated with date) .. to enable record keeping for yearly reports .. separate from unknown &/or CPVC # but also closely tied with these numbers once defined
Rejuvenation # = number automatically given to an initiated rejuvenation ‘event’ (must be associated closely with date and CPVC #) .. to enable record keeping for yearly reports
HE # = ‘herbaceous experiment’ # = number attached to each virus transfer done (must be associated closely with date and CPVC #); an HE # can be associated with acquisitions, rejuvenations, verifications and/or distributions
Verification # = number automatically given to an initiated verification ‘event’ (must be associated closely with date and CPVC #) .. to enable record keeping for yearly reports
Distribution # = number automatically given to an initiated distribution ‘event’ (must be associated closely with date) .. for yearly record keeping
Seq ID - A unique number that will be given to each sample to be NGS sequenced when the process is initiated. This will have to be closely associate with many factors that need to be tracked. Includes at least one sub name each given by the lab prepping the sequence and the company doing the actual sequence run
Clone ID - A unique number given to each new clone that will be stored for future use. These are usually PCR clones that have been sent for Sanger sequencing and will be used as positive controls for future PCR reactions. These need to be periodically rejuvenated. (must be associated closely with date and CPVC #)
RNA Extract # - A unique number given to each new RNA extraction that will be stored for future use. (must be associated closely with date and CPVC #)
Primer ID - Currently just given a number followed by a name (starting at 425 to 997) we may need to adjust this in the future.
FD Batch # - number automatically given to an initiated freeze drying ‘event’ (must be associated closely with date and CPVC #) .. to enable record keeping for yearly reports.
@rintoult @shannonasencio @michellelocke @ron-reade @banchinic
@Collections; please list and describe the identifiers you need to be included in DINA. If relevant, describe their relationships to other identifiers.
please also communicate if you have (or don't) a "primary identifier" that you would want to be prioritized over any other for function.