AAFC-BICoE / dina-planning

AAFC-DINA planning repository
3 stars 2 forks source link

identify your identifiers #186

Open heathercole opened 3 years ago

heathercole commented 3 years ago

@rintoult @shannonasencio @michellelocke @ron-reade @banchinic

@Collections; please list and describe the identifiers you need to be included in DINA. If relevant, describe their relationships to other identifiers.

please also communicate if you have (or don't) a "primary identifier" that you would want to be prioritized over any other for function.

shannonasencio commented 3 years ago

For DAO, the primary identifier is the barcode number (this truly internally unique number is to be considered the catalogue number). We must also record historic DAO accession numbers (these numbers are not guaranteed to be internally unique, but for most of the herbarium's history served as the catalogue number).

For DAOM, the primary identifier is a truly internally unique incrementing catalogue number assigned by collections staff. The barcode (also truly internally unique) is a secondary identifier. I want to investigate making the barcode the primary identifier in the future for new acquisitions.

michellelocke commented 3 years ago

For CNC:

banchinic commented 3 years ago

For CCAMF:

Database ID: That is the unique identifier for the record in the database, it is auto-generated. We never ever use it or look up our specimens using that number but it's there.

Collection Event ID: A number that is related to a Collection Event (the where, when, who and how an organism was collected). This number is used to auto-populate the data of specimen records by entering this number into the record and saving it (thereby auto-populating relevant data fields). Many Specimen IDs can use one Collecting Event ID.

Specimen ID: the primary identifier. It is how we find specimens in the collection (this number is physically on the specimen). This number needs to be prominent in the record. In our cases it is unique in the sense that not 2 different strains would have the same number BUT we add versions to that number. ex. : 1000A was made in 2005, 1000B was made in 2007 from 1000A, etc...

Collection ID: not an identifier but I think it's worth mentioning that our collection is separated in 2 : INVIVO and INVITRO and that prefix is in front of our Specimens ID on our labels. We can filter to only see one or the other and Specimen IDs can be duplicate, I have INVIVO1000A and INVITRO1000A

DAOM number: IF that specimen is pure and a slide has been deposited to DAOM then we have that barcode number associated at the specimen level, not all specimens have been deposited.

Other ID: Usually used for specimens received from other collections, recording their ID for that strain.

Slide number: the number of the microscopic slide that was made with spores from that specimen.

Molecular Sequence Numbers: Depending on how the molecular module works but we do keep track of this, currently in a notes field but we would need to have a specific field for that.

rintoult commented 3 years ago

Official CCFC Identifiers ie those with a function in the database

Every human readable identifier listed below has a unique identifier associated with it from the SeqDB database. In most cases these are invisible but can be discoverable through the web addresses for specific records or in some summary tables these are displayed.

These unique identifiers are what the 2D barcodes from SeqDB encode. This means that it is very important for these to be maintained in the new system. The reason is that we have spent the last 5 years labelling every item in the collection with these barcodes to allow for remote activities on the items using barcode readers. We use the barcode readers to initiate processes on these items and this interfaces with the database to update status on the items.

Specimen Identifiers

The historical application of identifiers in CCFC has been less than rigorous. This means we have a number of formats of identifiers which are catalogue numbers. In this case I mean the names applied that have been published. They include the following:

“Collection Code” “Collection number”: DAOMC 123456

“Collection Code” until 3 years ago was DAOM we changed it to DAOMC to avoid confusion with the herbarium collection going forward. Many identifiers are shared between the two collections and represent the same organisms. THIS IS SOMETHING I HOPE WE CAN EXPLOIT IN THE NEW SYSTEM! I would love for the records to be linked at least.

“Collection number”:Can range from 2 to 6 digits.

“Collection Code” “Collection number” “Sub ID”: Inheritited collections from some researchers have kept their original collector numbers as “Collection number”. These are further separated out by a second or addendum to the “Collection Code”. Due to the limitations of the “Specimen Identifier” fields in SeqDB that are just numerical these were added in that data set as a SubID.

Published Identifier Identifier in Database
DAOMC BR 144 or DAOM BR 144 or BR 144 DAOMC 144 BR
DAOMC F1353 or DAOM F1353  or F1353 DAOMC 1353 F
DAOMC G1961 or DAOM G1961  or G1961 DAOMC 1961 G

 

The rest of our identifiers are derived from this specimen identifier and include the following:

Specimen Replicate Name: concatenation of “Collection Code” “Collection number” “Sub ID” with a version added = DAOMC 123456 A, DAOMC1351FA or DAOMC1961GA

Sample Name: concatenation of “Collection Code” “Collection number” “Sub ID” with a version added = DAOMC 123456 A, DAOMC1351FA or DAOMC1961GA

Yes the human readable format of these two identifiers is identical, but they do have unique identifiers in the back end. And in the physical world the item labelled would hardly ever be confused when it comes to our workflows. Pretty easy to see the difference between a fungal culture and a DNA extract.

Sequence Name: There is no enforced formatting on these names but we follow these suggestions

Raw sequences: Sample Name_Experimenter_Batch_GeneRegion_Primer

Consensus sequences: Sample Name_Genus_species_GeneTarget

PCR Batches and Sequence Batches:  free form names

 

Flow and connectivity of human readable names

rintoult commented 3 years ago

Official CCFC Identifiers ie those with a function in the database

Every human readable identifier listed below has a unique identifier associated with it from the SeqDB database. In most cases these are invisible but can be discoverable through the web addresses for specific records or in some summary tables these are displayed.

These unique identifiers are what the 2D barcodes from SeqDB encode. This means that it is very important for these to be maintained in the new system. The reason is that we have spent the last 5 years labelling every item in the collection with these barcodes to allow for remote activities on the items using barcode readers. We use the barcode readers to initiate processes on these items and this interfaces with the database to update status on the items.

Specimen Identifiers

The historical application of identifiers in CCFC has been less than rigorous. This means we have a number of formats of identifiers which are catalogue numbers. In this case I mean the names applied that have been published. They include the following:

“Collection Code” “Collection number”: DAOMC 123456

“Collection Code” until 3 years ago was DAOM we changed it to DAOMC to avoid confusion with the herbarium collection going forward. Many identifiers are shared between the two collections and represent the same organisms. THIS IS SOMETHING I HOPE WE CAN EXPLOIT IN THE NEW SYSTEM! I would love for the records to be linked at least.

“Collection number”:Can range from 2 to 6 digits.

“Collection Code” “Collection number” “Sub ID”: Inheritited collections from some researchers have kept their original collector numbers as “Collection number”. These are further separated out by a second or addendum to the “Collection Code”. Due to the limitations of the “Specimen Identifier” fields in SeqDB that are just numerical these were added in that data set as a SubID.

Published Identifier Identifier in Database
DAOMC BR 144 or DAOM BR 144 or BR 144 DAOMC 144 BR
DAOMC F1353 or DAOM F1353  or F1353 DAOMC 1353 F
DAOMC G1961 or DAOM G1961  or G1961 DAOMC 1961 G

 

The rest of our identifiers are derived from this specimen identifier and include the following:

Specimen Replicate Name: concatenation of “Collection Code” “Collection number” “Sub ID” with a version added = DAOMC 123456 A, DAOMC1351FA or DAOMC1961GA

Sample Name: concatenation of “Collection Code” “Collection number” “Sub ID” with a version added = DAOMC 123456 A, DAOMC1351FA or DAOMC1961GA

Yes the human readable format of these two identifiers is identical, but they do have unique identifiers in the back end. And in the physical world the item labelled would hardly ever be confused when it comes to our workflows. Pretty easy to see the difference between a fungal culture and a DNA extract.

Sequence Name: There is no enforced formatting on these names but we follow these suggestions

Raw sequences: Sample Name_Experimenter_Batch_GeneRegion_Primer

Consensus sequences: Sample Name_Genus_species_GeneTarget

PCR Batches and Sequence Batches:  free form names

 

Flow and connectivity of human readable names see at bottome of message

We also have other names which are associated with unique identifiers that also use our barcode labelling system. These include:

Storage Units: Free form naming system

Storage Containers: Free form naming system

Other Identifiers tracked as text fields currently

“Other IDs” all other known identifiers associated with the organims, “DAOM Number” the herbarium catalogue number, “Isolate Number” this in most cases is equal to the Collection or Collector Number as discussed in collecting events, “CCFC Number” – these are hangover numbers never used publicly from a previous database attempt, “Institution Code” and “Permanent Collection Code” these are for reference to other recognised culture collections

 

CCFCidentifierlinkages

ron-reade commented 3 years ago

For the CPVC it is as follows: Primary identifiers:

CPVC # = catalog # = accession # … text & numbers & dashes (or underscores) .. this will gain ‘extensions’ over time as rejuvenations are done ..

Unknown # = material sample without taxonomic name .. different than CPVC # .. needs to be able to transition to CPVC # (once identified with taxonomic name)

Barcode # = This is new to the CPVC, but something we have been planning for a couple of years. We are thinking a combination of barcodes and QR codes depending on the item in question. For example, trees in the field will have QR codes that can be scanned and will bring up all relevant data on what is in that particular host and the organism(s) of interest inside of it. Items such as RNA samples and clones will have barcodes to help track them.

Secondary identifiers:

Acquisition # = number automatically given to a new material sample (must be associated with date) .. to enable record keeping for yearly reports .. separate from unknown &/or CPVC # but also closely tied with these numbers once defined

Rejuvenation # = number automatically given to an initiated rejuvenation ‘event’ (must be associated closely with date and CPVC #) .. to enable record keeping for yearly reports

HE # = ‘herbaceous experiment’ # = number attached to each virus transfer done (must be associated closely with date and CPVC #); an HE # can be associated with acquisitions, rejuvenations, verifications and/or distributions

Verification # = number automatically given to an initiated verification ‘event’ (must be associated closely with date and CPVC #) .. to enable record keeping for yearly reports

Distribution # = number automatically given to an initiated distribution ‘event’ (must be associated closely with date) .. for yearly record keeping

Seq ID - A unique number that will be given to each sample to be NGS sequenced when the process is initiated. This will have to be closely associate with many factors that need to be tracked. Includes at least one sub name each given by the lab prepping the sequence and the company doing the actual sequence run

Clone ID - A unique number given to each new clone that will be stored for future use. These are usually PCR clones that have been sent for Sanger sequencing and will be used as positive controls for future PCR reactions. These need to be periodically rejuvenated. (must be associated closely with date and CPVC #)

RNA Extract # - A unique number given to each new RNA extraction that will be stored for future use. (must be associated closely with date and CPVC #)

Primer ID - Currently just given a number followed by a name (starting at 425 to 997) we may need to adjust this in the future.

FD Batch # - number automatically given to an initiated freeze drying ‘event’ (must be associated closely with date and CPVC #) .. to enable record keeping for yearly reports.