BjornFJohansson / pydna

Clone with Python! Data structures for double stranded DNA & simulation of homologous recombination, Gibson assembly, cut & paste cloning.
Other
166 stars 45 forks source link

Cache sequence files as they are written and read #10

Closed BjornFJohansson closed 9 years ago

BjornFJohansson commented 9 years ago

This is a feature meant to ease the ipython notebook based work flow for pydna.

I think that the concept of having a directory describing a project or sub project is a sound one. This would enable version control and makes a natural boundary around the project. It is also easier to share and an ipython notebook fil would have access to all files in the directory. This unfortunately also implies that all sequences upon which the project depends has to be present in the directory, probably as genbank files.

This means that there would be a lot of manual looking for files and copying if the new project depends on the result of many older projects or files.

One solution could be to cache files written by pydna in some central location such as the data_dir that already is used to cache slow functions. Simplest would be to save each sequence file as

name_seguid.ext

where name is the original name, seguid is the seguid checksum of the file and ext is the extension that depends on the sequence type (like .gb for Genbank files).

The addition of the checksum in the name would allow different sequences with the same name, although it is poor practice, both files would be preserved.

A new "get" or "fetch" function would copy a file from a central repository of sequence files to the current working directory. The files would be referred to by their

name.ext

and the seguid would be stripped from the name before copying. If there are two sequences with the same name, both should be copied, retaining the seguid for both so that the reason for the collision could be inspected manually.

Sequences could also be read from strings , but it makes little sense to cache these.