Open CamelCaseCam opened 9 months ago
Could you give a specific example based off of the current schema?
yup - if we look at the pichia file in my fork, it reads
pAOX1:
...
pGAP:
...
and so on in the promoter section. I would like to have a file called plasmids
at collections/pichia
and instead put includes: ["pichia/plasmids"]
to keep things more organized.
Oh, so lemme get this straight:
collections.yaml
, rather than just having something like file: collections/pichia.yaml
, you have includes: ["pichia/plasmids/promoter.yaml"]
, right?
Not exactly. You'd still have file: collections/pichia.yaml
, but in pichia.yaml I could put includes: ["pichia/plasmids/promoter.yaml"]
Why not just have that in collections.yaml
then? includes
in the collection file (pichia.yaml
) would make the depth of files 1 layer larger, while breaking what could be seen as the structure of the file (ie, pichia.yaml
is a list of parts).
To take a step back on the overall design, I think the one thing I'm really going for is keeping it as simple / maintainable as possible, while still having enough information for automated systems to work on the sequences. Your pichia link is actually pretty much the perfect amount of information (except references should be a list). It has the following characteristics:
Rather than something like iGEM (pRham promoter for example), I'd like to embrace and go for something more along the lines of what futurehouse's wikicrow is doing. The actual documentation for each part is contained deep within the scientific literature, which we now have the ability to crawl with LLMs.
On the other side, I'd like to embrace teaching-by-example, so hand-crafting part combinations with descriptions of what they do in order to serve as training data for LLMs and humans to put together parts. But that is separate.
Otherwise, these files could get pretty big.
They don't get too big, usually <200kb per toolkit. The limiting factor is pretty much guaranteed to be "oh shit I can't reasonably synthesize this".
Fair enough - that makes sense!
Have you considered adding an include statement field to divide content from toolkits between multiple organisms? Otherwise, these files could get pretty big.
I'm thinking something simple like
includes: ["x", "y", "z"]