Changed the input from a path to a FASTQ file to a path to a directory: The output of Guppy is now stored in multiple FASTQ files under the barcodeXX/ directory. Previously, it was necessary to combine the FASTQ files in the barcodeXX/ directory into one and specify it as an argument. With this revision, it is now possible to directly specify the barcodeXX directory, allowing users to seamlessly proceed to DAJIN2 analysis after Guppy processing.
Commit Detail
π Documentation
Changed conda config --set channel_priority strict to conda config --set channel_priority flexible for installation process in TROUBLESHOOTING.md. Commit Detail
Changed the definition of the minor allele from a read number of less than or equal to 10 to less than or equal to 5. This is based on the assumption that one sample contains 1000 reads, where 0.5% corresponds to 5 reads. Commit Detail
π§ Update
Update preprocess.insertion_to_fasta to facilitate the discrimination of Insertion alleles, the Reference for Insertion alleles has been saved in FASTA/HTML directory. Commit Detail
Update insertions_to_fasta.extract_enriched_insertions: Previously, it calculated the presence ratio of insertion alleles separately for samples and controls, filtering at 0.5%. However, due to a threshold issue, some control insertions were narrowly missing the threshold, resulting in them being incorrectly identified as sample-specific insertions. To rectify this, the algorithm now clusters samples and controls together, excluding clusters where both types are mixed. This modification allows for the extraction of sample-specific insertion alleles. Commit Detail
Updated preprocess.insertions_to_fasta.count_insertions of the counting method to treat similar insertions as identical. Previously, the same insertion was erroneously counted as different ones due to sequence errors. Commit Detail
Updated preprocess.insertions_to_fasta.merge_similar_insertions: Previously, clustering was done using MiniBatchKMeans, but this method had an issue where it excessively clustered when only highly similar insertion sequences existed. Therefore, a strategy similar to extract_enriched_insertions was adopted, changing the algorithm to one that mixes with a uniform distribution of random scores before clustering. Commit Detail
Added preprocess.insertions_to_fasta.clustering_insertions: Combined the clustering methods used in extract_enriched_insertions and merge_similar_insertions into a common function. Commit Detail
Moved the call_sequence function to the cssplits_handler module. Commit Detail
π Bug Fixes
Debug clustering.merge_labels to be able to correctly revert minor labels back to parent labels. Commit Detail
Updated utils.input_validator.validate_genome_and_fetch_urls to obtain available_server more explicitly. Previously, it relied on HTTP response codes, but there were instances where the UCSC Genome Browser showed a normal (200) response while internally being in error. Therefore, with this change, a more explicit method is employed by searching for specific keywords present in the normal HTML, to determine if the server is functioning correctly. Commit Detail
Added config.reset_logging to reset the logging configuration. Previously, when batch processing multiple experiment IDs (names), a bug existed where the log settings from previous experiments remained, and the log file name was not updated. However, with this change, log files are now created for each experiment ID. Commit Detail
Debugged core.py: Modified the specification of paths_predefined_fasta to accept input from user-entered ALLELE data. Previously, it accepted fasta files stored in the fasta directory. However, this approach had a bug where fasta files left over from a previously aborted run (which included newly created insertions) were treated as predefined. This resulted in new insertions being incorrectly categorized as predefined. Commit Detail
v0.4.0
π₯ Breaking
barcodeXX/
directory. Previously, it was necessary to combine the FASTQ files in thebarcodeXX/
directory into one and specify it as an argument. With this revision, it is now possible to directly specify thebarcodeXX
directory, allowing users to seamlessly proceed to DAJIN2 analysis after Guppy processing. Commit Detailπ Documentation
conda config --set channel_priority strict
toconda config --set channel_priority flexible
for installation process in TROUBLESHOOTING.md. Commit Detailπ New Features
Apple Silicon (ARM64) supoorts. Commit Detail
Changed the definition of the minor allele from a read number of less than or equal to 10 to less than or equal to 5. This is based on the assumption that one sample contains 1000 reads, where 0.5% corresponds to 5 reads. Commit Detail
π§ Update
Update
preprocess.insertion_to_fasta
to facilitate the discrimination of Insertion alleles, the Reference for Insertion alleles has been saved in FASTA/HTML directory. Commit DetailUpdate
insertions_to_fasta.extract_enriched_insertions
: Previously, it calculated the presence ratio of insertion alleles separately for samples and controls, filtering at 0.5%. However, due to a threshold issue, some control insertions were narrowly missing the threshold, resulting in them being incorrectly identified as sample-specific insertions. To rectify this, the algorithm now clusters samples and controls together, excluding clusters where both types are mixed. This modification allows for the extraction of sample-specific insertion alleles. Commit DetailUpdated
preprocess.insertions_to_fasta.count_insertions
of the counting method to treat similar insertions as identical. Previously, the same insertion was erroneously counted as different ones due to sequence errors. Commit DetailUpdated
preprocess.insertions_to_fasta.merge_similar_insertions
: Previously, clustering was done using MiniBatchKMeans, but this method had an issue where it excessively clustered when only highly similar insertion sequences existed. Therefore, a strategy similar toextract_enriched_insertions
was adopted, changing the algorithm to one that mixes with a uniform distribution of random scores before clustering. Commit DetailAdded
preprocess.insertions_to_fasta.clustering_insertions
: Combined the clustering methods used inextract_enriched_insertions
andmerge_similar_insertions
into a common function. Commit DetailMoved the
call_sequence
function to thecssplits_handler
module. Commit Detailπ Bug Fixes
Debug
clustering.merge_labels
to be able to correctly revert minor labels back to parent labels. Commit DetailUpdated
utils.input_validator.validate_genome_and_fetch_urls
to obtainavailable_server
more explicitly. Previously, it relied on HTTP response codes, but there were instances where the UCSC Genome Browser showed a normal (200) response while internally being in error. Therefore, with this change, a more explicit method is employed by searching for specific keywords present in the normal HTML, to determine if the server is functioning correctly. Commit DetailAdded
config.reset_logging
to reset the logging configuration. Previously, when batch processing multiple experiment IDs (names), a bug existed where the log settings from previous experiments remained, and the log file name was not updated. However, with this change, log files are now created for each experiment ID. Commit DetailDebugged
core.py
: Modified the specification ofpaths_predefined_fasta
to accept input from user-entered ALLELE data. Previously, it accepted fasta files stored in the fasta directory. However, this approach had a bug where fasta files left over from a previously aborted run (which included newly created insertions) were treated as predefined. This resulted in new insertions being incorrectly categorized as predefined. Commit Detail