This is a feature meant to ease the ipython notebook based work flow for pydna.
I think that the concept of having a directory describing a project or sub project is a sound one. This would enable version control and makes a natural boundary around the project. It is also easier to share and an ipython notebook fil would have access to all files in the directory. This unfortunately also implies that all sequences upon which the project depends has to be present in the directory, probably as genbank files.
This means that there would be a lot of manual looking for files and copying if the new project depends on the result of many older projects or files.
One solution could be to cache files written by pydna in some central location such as the data_dir that already is used to cache slow functions. Simplest would be to save each sequence file as
name_seguid.ext
where name is the original name, seguid is the seguid checksum of the file and ext is the extension that depends on the sequence type (like .gb for Genbank files).
The addition of the checksum in the name would allow different sequences with the same name, although it is poor practice, both files would be preserved.
A new "get" or "fetch" function would copy a file from a central repository of sequence files to the current working directory. The files would be referred to by their
name.ext
and the seguid would be stripped from the name before copying. If there are two sequences with the same name, both should be copied, retaining the seguid for both so that the reason for the collision could be inspected manually.
Sequences could also be read from strings , but it makes little sense to cache these.
This is a feature meant to ease the ipython notebook based work flow for pydna.
I think that the concept of having a directory describing a project or sub project is a sound one. This would enable version control and makes a natural boundary around the project. It is also easier to share and an ipython notebook fil would have access to all files in the directory. This unfortunately also implies that all sequences upon which the project depends has to be present in the directory, probably as genbank files.
This means that there would be a lot of manual looking for files and copying if the new project depends on the result of many older projects or files.
One solution could be to cache files written by pydna in some central location such as the data_dir that already is used to cache slow functions. Simplest would be to save each sequence file as
where name is the original name, seguid is the seguid checksum of the file and ext is the extension that depends on the sequence type (like .gb for Genbank files).
The addition of the checksum in the name would allow different sequences with the same name, although it is poor practice, both files would be preserved.
A new "get" or "fetch" function would copy a file from a central repository of sequence files to the current working directory. The files would be referred to by their
and the seguid would be stripped from the name before copying. If there are two sequences with the same name, both should be copied, retaining the seguid for both so that the reason for the collision could be inspected manually.
Sequences could also be read from strings , but it makes little sense to cache these.