CCB-SB / plsdb

PLSDB pipeline to collect bacterial plasmids from NCBI
https://ccb-microbe.cs.uni-saarland.de/plsdb/
35 stars 4 forks source link

PLSDB (v. 2020_11_19) BioSample Attributes #10

Closed haruosuz closed 2 years ago

haruosuz commented 3 years ago

PLSDB (version 2020_11_19) was inspected.

For Location_BIOSAMPLE, of the 27939 records, 9057 were NA, 1112 were "missing", and 259 were "not applicable". Maybe the uninformative words (e.g. "missing" and "not applicable") should be NA?

I wonder if collection_date could be included in PLSDB? https://www.ncbi.nlm.nih.gov/biosample/docs/attributes/

Is it better to filter out unusual sequences such as plasmid with GC_NUCCORE = 0.0106 ? https://www.ncbi.nlm.nih.gov/nuccore/NZ_CP026741.1

VGalata commented 3 years ago

Dear @haruosuz,

I agree with the first two suggestions: the values in column Location_BIOSAMPLE matching "missing" and "not applicable" could be set to NA to indicate missing information, and we could include the Biosample collection date if it is available.

Regarding your last question about whether it is better to filter out unusual plasmid sequences based on their GC content: depending on what you want to do with the data, I would say that removing records with extreme and unusual values makes sense. Additionally to GC content, you can consider to use sequence length (see also issue #8).

Let me know if you have further questions.

@SmalJonni, @Xethic Could we please include the first two suggestions in the next update?

  1. Replacing "missing" and "not applicable" values in Biosample's data table should be straight forward. I think these values can also appear in Biosample attributes.
  2. Including Biosample's collection date should not be very difficult. Let me know if you need help with the edirect's query commands. As @haruosuz said, the format of this field is described here.
Xethic commented 3 years ago

We scheduled it for the next major update. Thanks!

SmalJonni commented 2 years ago