identify your identifiers

heathercole commented 3 years ago

@rintoult @shannonasencio @michellelocke @ron-reade @banchinic

@Collections; please list and describe the identifiers you need to be included in DINA. If relevant, describe their relationships to other identifiers.

please also communicate if you have (or don't) a "primary identifier" that you would want to be prioritized over any other for function.

shannonasencio commented 3 years ago

For DAO, the primary identifier is the barcode number (this truly internally unique number is to be considered the catalogue number). We must also record historic DAO accession numbers (these numbers are not guaranteed to be internally unique, but for most of the herbarium's history served as the catalogue number).

For DAOM, the primary identifier is a truly internally unique incrementing catalogue number assigned by collections staff. The barcode (also truly internally unique) is a secondary identifier. I want to investigate making the barcode the primary identifier in the future for new acquisitions.

michellelocke commented 3 years ago

For CNC:

Specimen ID: Similar to a catalog number or DAO's barcode; is a unique number, and the primary identifier. It is how we find specimens in the collection (this number is physically on the specimen in almost all cases). This number needs to be prominent in the record.
Collection Event ID: A number that is related to a Collection Event (the where, when, who and how an organism was collected). This number is used to auto-populate the data of specimen records by entering this number into the record and saving it (thereby auto-populating relevant data fields). Many Specimen IDs can use one Collecting Event ID.
Dissection ID: A number that would be put on a dissected part of a specimen that is kept separate from the main specimen. Needs to be unique amongst this field of numbers. The main users of this is our Lepidoptera collections. Christi Jaeger can be consulted more specific details. This is always related to a Specimen ID and is kind of like a subsample or a piece of the organism.
Other ID: A catch all for other numbers that are unique or semi-unique. Examples of things currently found in this field:
1. A unique identifier from another institution; that is a second Specimen ID.
2. A CNC Type number (unique to a type series, which may be one or many specimens, a type of accession number for type series).
3. A historical collecting event number *Note: sometimes it is difficult to know if a historical number is a unique specimen ID number or a collecting event number. Without context, knowledge of the original collector's notes or other specimens with similar (or the same) numbers it can be impossible to know the meaning of some numbers.
4. BOLD Sample ID: Unique number specifically from BOLD database. Sample ID is same as a Specimen ID, often they are the same but may be another Specimen ID added to the specimen at Biodiversity Institute of Ontario (if different it is generally because BOLD Sample ID was added first then a second CNC# when it came back to CNC).
5. BOLD Process ID: Unique number specifically from BOLD database. Process ID is the number given to the DNA sequence that is linked to the specimen with the Sample ID.
6. BOLD BIN: Barcode Index Number, non-unique number that refers to a cluster of DNA sequences.
7. JSM Number: Jeff Skevington Molecular number, unique number used by Jeff Skevington as a way to organize material in molecular freezer. Could point to an entire specimen or to a part removed for DNA. I believe this is more of a temporary organizational number to keep material in the freezer in order. I do not know if any other researchers use a similar system for material they keep for sequencing. Maybe could included with storage info (but we should clarify with him)?
8. GenBank Number: Unique ID of the DNA sequence submitted to GenBank, could be more than one associated to a Specimen ID.
9. Notebook codes/Project codes: May refer to a collecting event, but generally a code put on a specimen with a corresponding number in a notebook/excel file that contains information about the collecting event or specimen. We do not always have the original data from the project/collector. Often entered into Notes field but sometimes in Other ID.
10. Projects: Probably should have it's own home and not stay here. Names of projects have been put in this field for lack of a better place to put them.
11. Other numbers: Sometimes there are other numbers that we do not know the meaning of. They are sometimes entered in Other ID and sometimes Notes.
Molecular Sequence Numbers: Scientist specific and not stored in CNCDB. However it seems like we are trying to head toward storing them in DINA. We have BOLD and some GenBank numbers in the database but as scientists sequence material in house, they do not keep track of this in our database. Looking to the future, this would be good to talk to them about how they would use this function if available to them so that we could store sequence info for our specimens.

banchinic commented 3 years ago

For CCAMF:

Database ID: That is the unique identifier for the record in the database, it is auto-generated. We never ever use it or look up our specimens using that number but it's there.

Collection Event ID: A number that is related to a Collection Event (the where, when, who and how an organism was collected). This number is used to auto-populate the data of specimen records by entering this number into the record and saving it (thereby auto-populating relevant data fields). Many Specimen IDs can use one Collecting Event ID.

Specimen ID: the primary identifier. It is how we find specimens in the collection (this number is physically on the specimen). This number needs to be prominent in the record. In our cases it is unique in the sense that not 2 different strains would have the same number BUT we add versions to that number. ex. : 1000A was made in 2005, 1000B was made in 2007 from 1000A, etc...

Collection ID: not an identifier but I think it's worth mentioning that our collection is separated in 2 : INVIVO and INVITRO and that prefix is in front of our Specimens ID on our labels. We can filter to only see one or the other and Specimen IDs can be duplicate, I have INVIVO1000A and INVITRO1000A

DAOM number: IF that specimen is pure and a slide has been deposited to DAOM then we have that barcode number associated at the specimen level, not all specimens have been deposited.

Other ID: Usually used for specimens received from other collections, recording their ID for that strain.

Slide number: the number of the microscopic slide that was made with spores from that specimen.

Molecular Sequence Numbers: Depending on how the molecular module works but we do keep track of this, currently in a notes field but we would need to have a specific field for that.

rintoult commented 3 years ago

Official CCFC Identifiers ie those with a function in the database

Every human readable identifier listed below has a unique identifier associated with it from the SeqDB database. In most cases these are invisible but can be discoverable through the web addresses for specific records or in some summary tables these are displayed.

These unique identifiers are what the 2D barcodes from SeqDB encode. This means that it is very important for these to be maintained in the new system. The reason is that we have spent the last 5 years labelling every item in the collection with these barcodes to allow for remote activities on the items using barcode readers. We use the barcode readers to initiate processes on these items and this interfaces with the database to update status on the items.

Specimen Identifiers

The historical application of identifiers in CCFC has been less than rigorous. This means we have a number of formats of identifiers which are catalogue numbers. In this case I mean the names applied that have been published. They include the following:

“Collection Code” “Collection number”: DAOMC 123456

“Collection Code” until 3 years ago was DAOM we changed it to DAOMC to avoid confusion with the herbarium collection going forward. Many identifiers are shared between the two collections and represent the same organisms. THIS IS SOMETHING I HOPE WE CAN EXPLOIT IN THE NEW SYSTEM! I would love for the records to be linked at least.

“Collection number”:Can range from 2 to 6 digits.

“Collection Code” “Collection number” “Sub ID”: Inheritited collections from some researchers have kept their original collector numbers as “Collection number”. These are further separated out by a second or addendum to the “Collection Code”. Due to the limitations of the “Specimen Identifier” fields in SeqDB that are just numerical these were added in that data set as a SubID.

Published Identifier	Identifier in Database
DAOMC BR 144 or DAOM BR 144 or BR 144	DAOMC 144 BR
DAOMC F1353 or DAOM F1353 or F1353	DAOMC 1353 F
DAOMC G1961 or DAOM G1961 or G1961	DAOMC 1961 G

The rest of our identifiers are derived from this specimen identifier and include the following:

Specimen Replicate Name: concatenation of “Collection Code” “Collection number” “Sub ID” with a version added = DAOMC 123456 A, DAOMC1351FA or DAOMC1961GA

Sample Name: concatenation of “Collection Code” “Collection number” “Sub ID” with a version added = DAOMC 123456 A, DAOMC1351FA or DAOMC1961GA

Yes the human readable format of these two identifiers is identical, but they do have unique identifiers in the back end. And in the physical world the item labelled would hardly ever be confused when it comes to our workflows. Pretty easy to see the difference between a fungal culture and a DNA extract.

Sequence Name: There is no enforced formatting on these names but we follow these suggestions

Raw sequences: Sample Name_Experimenter_Batch_GeneRegion_Primer

Consensus sequences: Sample Name_Genus_species_GeneTarget

PCR Batches and Sequence Batches: free form names

Flow and connectivity of human readable names

rintoult commented 3 years ago

Official CCFC Identifiers ie those with a function in the database

Every human readable identifier listed below has a unique identifier associated with it from the SeqDB database. In most cases these are invisible but can be discoverable through the web addresses for specific records or in some summary tables these are displayed.

These unique identifiers are what the 2D barcodes from SeqDB encode. This means that it is very important for these to be maintained in the new system. The reason is that we have spent the last 5 years labelling every item in the collection with these barcodes to allow for remote activities on the items using barcode readers. We use the barcode readers to initiate processes on these items and this interfaces with the database to update status on the items.

Specimen Identifiers

The historical application of identifiers in CCFC has been less than rigorous. This means we have a number of formats of identifiers which are catalogue numbers. In this case I mean the names applied that have been published. They include the following:

“Collection Code” “Collection number”: DAOMC 123456

“Collection Code” until 3 years ago was DAOM we changed it to DAOMC to avoid confusion with the herbarium collection going forward. Many identifiers are shared between the two collections and represent the same organisms. THIS IS SOMETHING I HOPE WE CAN EXPLOIT IN THE NEW SYSTEM! I would love for the records to be linked at least.

“Collection number”:Can range from 2 to 6 digits.

“Collection Code” “Collection number” “Sub ID”: Inheritited collections from some researchers have kept their original collector numbers as “Collection number”. These are further separated out by a second or addendum to the “Collection Code”. Due to the limitations of the “Specimen Identifier” fields in SeqDB that are just numerical these were added in that data set as a SubID.

Published Identifier	Identifier in Database
DAOMC BR 144 or DAOM BR 144 or BR 144	DAOMC 144 BR
DAOMC F1353 or DAOM F1353 or F1353	DAOMC 1353 F
DAOMC G1961 or DAOM G1961 or G1961	DAOMC 1961 G

The rest of our identifiers are derived from this specimen identifier and include the following:

Specimen Replicate Name: concatenation of “Collection Code” “Collection number” “Sub ID” with a version added = DAOMC 123456 A, DAOMC1351FA or DAOMC1961GA

Sample Name: concatenation of “Collection Code” “Collection number” “Sub ID” with a version added = DAOMC 123456 A, DAOMC1351FA or DAOMC1961GA

Yes the human readable format of these two identifiers is identical, but they do have unique identifiers in the back end. And in the physical world the item labelled would hardly ever be confused when it comes to our workflows. Pretty easy to see the difference between a fungal culture and a DNA extract.

Sequence Name: There is no enforced formatting on these names but we follow these suggestions

Raw sequences: Sample Name_Experimenter_Batch_GeneRegion_Primer

Consensus sequences: Sample Name_Genus_species_GeneTarget

PCR Batches and Sequence Batches: free form names

Flow and connectivity of human readable names see at bottome of message

We also have other names which are associated with unique identifiers that also use our barcode labelling system. These include:

Storage Units: Free form naming system

Storage Containers: Free form naming system

Other Identifiers tracked as text fields currently

“Other IDs” all other known identifiers associated with the organims, “DAOM Number” the herbarium catalogue number, “Isolate Number” this in most cases is equal to the Collection or Collector Number as discussed in collecting events, “CCFC Number” – these are hangover numbers never used publicly from a previous database attempt, “Institution Code” and “Permanent Collection Code” these are for reference to other recognised culture collections

CCFCidentifierlinkages

ron-reade commented 3 years ago

For the CPVC it is as follows: Primary identifiers:

CPVC # = catalog # = accession # … text & numbers & dashes (or underscores) .. this will gain ‘extensions’ over time as rejuvenations are done ..

Unknown # = material sample without taxonomic name .. different than CPVC # .. needs to be able to transition to CPVC # (once identified with taxonomic name)

Barcode # = This is new to the CPVC, but something we have been planning for a couple of years. We are thinking a combination of barcodes and QR codes depending on the item in question. For example, trees in the field will have QR codes that can be scanned and will bring up all relevant data on what is in that particular host and the organism(s) of interest inside of it. Items such as RNA samples and clones will have barcodes to help track them.

Secondary identifiers:

Acquisition # = number automatically given to a new material sample (must be associated with date) .. to enable record keeping for yearly reports .. separate from unknown &/or CPVC # but also closely tied with these numbers once defined

Rejuvenation # = number automatically given to an initiated rejuvenation ‘event’ (must be associated closely with date and CPVC #) .. to enable record keeping for yearly reports

HE # = ‘herbaceous experiment’ # = number attached to each virus transfer done (must be associated closely with date and CPVC #); an HE # can be associated with acquisitions, rejuvenations, verifications and/or distributions

Verification # = number automatically given to an initiated verification ‘event’ (must be associated closely with date and CPVC #) .. to enable record keeping for yearly reports

Distribution # = number automatically given to an initiated distribution ‘event’ (must be associated closely with date) .. for yearly record keeping

Seq ID - A unique number that will be given to each sample to be NGS sequenced when the process is initiated. This will have to be closely associate with many factors that need to be tracked. Includes at least one sub name each given by the lab prepping the sequence and the company doing the actual sequence run

Clone ID - A unique number given to each new clone that will be stored for future use. These are usually PCR clones that have been sent for Sanger sequencing and will be used as positive controls for future PCR reactions. These need to be periodically rejuvenated. (must be associated closely with date and CPVC #)

RNA Extract # - A unique number given to each new RNA extraction that will be stored for future use. (must be associated closely with date and CPVC #)

Primer ID - Currently just given a number followed by a name (starting at 425 to 997) we may need to adjust this in the future.

FD Batch # - number automatically given to an initiated freeze drying ‘event’ (must be associated closely with date and CPVC #) .. to enable record keeping for yearly reports.

AAFC-BICoE / dina-planning

identify your identifiers #186