VerisimilitudeX / DNAnalyzer

Revolutionizing DNA analysis and making it accessible to all through innovative ML-powered analysis and interpretive tools.
Other
135 stars 56 forks source link

Research and Database Creation: Organism Data with GC-Content #378

Closed VerisimilitudeX closed 1 year ago

VerisimilitudeX commented 1 year ago

This GitHub issue focuses on conducting research and creating an SQL database that combines organism data with corresponding GC-content information. The primary objective is to gather and compile a comprehensive dataset of organisms, along with their respective GC-content values, to facilitate further analysis and research.

Key tasks for this project include:

  1. Research and Data Collection: Conduct extensive research to gather organism data from reliable sources. This may involve exploring public databases, scientific literature, or existing datasets. Focus on acquiring diverse organisms representing various species across different taxonomic groups.

  2. GC-Content Calculation: Develop a robust methodology to calculate the GC-content for each organism's DNA or RNA sequences. Consider the nuances of GC-content calculation, such as accounting for sequence length, handling repetitive regions, and appropriately dealing with ambiguous nucleotides (e.g., N).

  3. Database Schema Design: Design an appropriate SQL database schema to store the organism data and corresponding GC-content information. Determine the necessary fields, data types, and relationships to efficiently represent the data. Consider incorporating relevant metadata, such as organism taxonomy, common name, and additional features if available.

  4. Database Creation: Implement the designed schema and create the SQL database. Choose a suitable database management system (e.g., MySQL, PostgreSQL) and ensure proper setup and configuration. Populate the database with the acquired organism data, including the calculated GC-content values.

  5. Data Quality Assurance: Perform data quality checks and validation to ensure the accuracy and integrity of the stored information. Verify the correctness of the GC-content calculations and cross-reference the data with trusted sources for validation.

  6. Database Documentation: Document the created SQL database, including the schema structure, table definitions, and relationships. Provide clear instructions on how to access and query the database. Consider generating sample queries to showcase potential use cases and demonstrate the utility of the database.

  7. Data Sharing and Collaboration: Share the SQL database and associated documentation on a public GitHub repository or an appropriate platform. Encourage collaboration and contributions from the community, such as suggesting improvements, adding additional data, or proposing new features for the database.

Contributors should primarily focus on conducting thorough research to acquire reliable organism data and ensure accurate GC-content calculations. Emphasize creating a well-documented and easily accessible SQL database that can serve as a valuable resource for researchers and further analysis in the field of genomics and computational biology.

LimesKey commented 1 year ago

Currently I am working on gathering some FASTA genome files of different species, then inputting them in the program to read the GC Content, logging it then adding it to a spreadsheet and logging the results.

VerisimilitudeX commented 1 year ago

Sounds good. Let me know if you encounter any bugs/errors in the program.

LimesKey commented 1 year ago

@VerisimilitudeX To help me automate this process and speed it up, I made both a file automation script and a GC content script using both my favourite languages, Powershell and Rust. Rust is so fast the only limitation is my 100Mb ethernet connection to my NAS. The code below is very rough and not optimized but it works for me, it might not work for you and you likely have to edit some things if you want to try it.

DNAnalyzer-File_Processor.zip This zip file contains 3 files, the Powershell automation script, the Rust GC-content script, and the compiled Rust GC-content script. Right now, the only limitation is manually downloading tens of genomes from the NCBI ftp (i could make a powershell script to automate that too).

image

VerisimilitudeX commented 1 year ago

Nice! I tried it and it looks great; good job. We just need to figure out how we can integrate Rust and Java together.

VerisimilitudeX commented 1 year ago

In the meantime, can you make a quick PR with this? I don't want to lose track of the code.

LimesKey commented 1 year ago

In the meantime, can you make a quick PR with this? I don't want to lose track of the code.

Well it's only a temporary and no one else will probably use it. I could open a branch and put it there though.

VerisimilitudeX commented 1 year ago

Sure, good idea. We can have a branch for Rust code specifically.

From: LimesKey @.> Sent: Friday, June 9, 2023 10:45 AM To: VerisimilitudeX/DNAnalyzer @.> Cc: Piyush Acharya @.>; Mention @.> Subject: Re: [VerisimilitudeX/DNAnalyzer] Research and Database Creation: Organism Data with GC-Content (Issue #378)

In the meantime, can you make a quick PR with this? I don't want to lose track of the code.

Well it's only a temporary and no one else will probably use it. I could open a branch and put it there though.

- Reply to this email directly, view it on GitHubhttps://github.com/VerisimilitudeX/DNAnalyzer/issues/378#issuecomment-1584937104, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AW6R7ERWCH6E63H4RRLSBLDXKNOJDANCNFSM6AAAAAAYJZJF3A. You are receiving this because you were mentioned.Message ID: @.**@.>>

LimesKey commented 1 year ago

I was looking more at this article by NCBI and another program what does what the article says could happen. In the article it says it could work which is true, I agree, but in their example program called GCSpeciesSorter, they used more than 10,000 sample training files to train something called a SVM using Python.

I don't have experience with using SVM but it looks like a great way to increase performance. There is already a Rust Crate that allows me to use SVM. I'll check that out. If I can get enough NCBI genome test files, I have around 1TB of space to store the files and train the SVM.

LimesKey commented 1 year ago

I finished creating and automating everything, and I'm running it on my machine right now locally. You don't even need any crazy hardware to run it, just a 50GB SSD and a decent internet connection and it should run well. Try it out: https://github.com/VerisimilitudeX/DNAnalyzer/tree/database-gc-content/utils

Here is a zip file containing the first few outputs of the program, output.zip.

LimesKey commented 1 year ago

I have accumulated 1,448 plant files with their GC Content with the program. Now all we need is another script to process it and insert it into the SQL Database. I have the files zipped below, to get the GC content of the plant files just scroll to the bottom of the text file and find the Average GC-Content: line.

GC-Content-Plant_1488.zip @VerisimilitudeX