biocommons / biocommons.seqrepo

non-redundant, compressed, journalled, file-based storage for biological sequences
Apache License 2.0
39 stars 35 forks source link

Generalize the seqrepo interface and implement new backends #136

Open reece opened 4 months ago

reece commented 4 months ago
Difficulty Expected Duration Possible Mentors
Medium 175h @reece

Summary

SeqRepo provides a simple interface to biological sequences and subsequences, with a single backend that provides fast random-access to local, non-redundant, compressed, and journaled sequences. The original use case for SeqRepo was to provide fast and reliable access to sequences in a clinical genetics reporting pipeline. (See design)

The goal of this issue is to create an abstract interface that supports other storage backends, as well as caching and federation layers as depicted here:

Image

See #61 for additional information.

Community Benefits

When implemented, this project will enable the following (and ideally implement a few of them):

Expected Results / Deliverables

Required and Desired Skills

Benefits to Intern

The internship will gain software architecture and interface abstraction skills while solving a contemporary practical issue for modern bioinformatics.

How to apply

Students applying to this project should briefly describe their vision for this project, highlight their existing skills and the skills they would need to learn, and estimate an implementation timeline.

manulpatel commented 3 months ago

Hello @reece! I am Manul, from India working as a backend engineer building RESTful APIs in TypeScript, NestJS, and PostgresSQL as a database. In my current project, I am trying to implement Redis for session managemnt in my organisation. I have also contributed to python based open source projects.

I am interested to implement these various storage backends for the SeqRepo and be a part of the biocommons community. I couldn't find much info here, so could you please hint on what further steps or tasks other than porposal prep, do I need to follow to be a contributor to biocommons org? Also is there any other communication channel do I need to be part of, as I can't enter the official Slack without the domain email?

Harsh-2004 commented 3 months ago

Dear @reece ,

I hope this message finds you well. I am Harsha Aditya, a third-year undergraduate student at IIT Kanpur, majoring in Bioengineering. I am excited to apply for the SeqRepo project internship opportunity and contribute to its development.

Vision for the Project: My vision for SeqRepo is to extend its capabilities by implementing an abstract interface that supports various storage backends, caching mechanisms, and federation layers. I aim to create a flexible and scalable solution that seamlessly integrates with different data sources while ensuring fast and reliable access to biological sequences. Leveraging my expertise in C++ and Python, along with my knowledge of sequence alignment algorithms, I intend to enhance SeqRepo's functionality to meet the evolving needs of bioinformatics research and clinical genetics reporting.

Existing Skills: As a Quant developer and researcher at Devine Group and WorldQuant, I have gained significant experience in Python programming and utilizing common libraries. My background in quantitative finance has honed my skills in data analysis, algorithm development, and software engineering. Additionally, my knowledge of sequence alignment software and algorithms will be instrumental in understanding the domain-specific requirements of SeqRepo and designing efficient solutions.

Skills to Learn: While I am proficient in Python, I recognize the importance of expanding my skills to include backend-specific technologies such as Redis and AWS S3 for this project. I am committed to dedicating time to self-study and practical application to acquire the necessary skills. Furthermore, I am eager to deepen my understanding of caching techniques and explore how they can be applied to optimize SeqRepo's performance.

Implementation Timeline: Based on my initial assessment, I estimate that defining and implementing the abstract interface will take approximately 50 hours. Adapting the Fastadir to use the interface and incorporating the REST interface could require around 70 hours. Implementing a local sequence cache may take 40 hours, while integrating Redis, S3, or other backends could vary depending on their complexity, requiring around 55-60 hours each.

Conclusion: I am enthusiastic about the opportunity to contribute to SeqRepo and leverage my skills to address contemporary challenges in bioinformatics. I am confident that my background in C++, Python, and bioengineering, combined with my research experience, make me well-suited for this project. I am eager to collaborate with you and the team to achieve our objectives and advance SeqRepo's capabilities.

Thank you for considering my application. I look forward to the possibility of working together on this exciting project. Pls direct me to further steps

Warm regards, Harsha Aditya

jsstevenson commented 4 weeks ago

Also linking #61 to this

manulpatel commented 4 weeks ago

Hi @jsstevenson! Is there any plan to implement new backends in the project anytime soon? I would like to work on this outside GSoC. I would be happy to learn the new tech here if you could hint on some starting points?