BioJulia / FASTX.jl

Parse and process FASTA and FASTQ formatted files of biological sequences.
https://biojulia.dev
MIT License
61 stars 20 forks source link

Transcode between FASTQ & FASTA files #53

Closed TransGirlCodes closed 3 years ago

TransGirlCodes commented 3 years ago

For issue #50

For a consistent conversion between HTS formats, we'll probably need a separate package with a more generic version of the transcode function I've implemented here. Possibly using convert and promote, although for this example a simple FASTA.Record constructor accepting a FASTQ.Record was enough.

Types of changes

This PR implements the following changes: (Please tick any or all of the following that are applicable)

:ballot_box_with_check: Checklist

jakobnissen commented 3 years ago

That sounds right. In general, I think data processing shouldn't be done on FASTA.Record or FASTQ.Record - because these objects are not Julia native data structures, they're useful for IO, but not more.

Transcoding between BAM/FASTQ/FASTA is a bit niche, but does require working directly on Records. So a package for BioHTSFormats or something would be a good idea.

codecov[bot] commented 3 years ago

Codecov Report

Merging #53 (1d76c18) into master (0d81f16) will increase coverage by 0.28%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #53      +/-   ##
==========================================
+ Coverage   83.60%   83.89%   +0.28%     
==========================================
  Files          11       12       +1     
  Lines         616      627      +11     
==========================================
+ Hits          515      526      +11     
  Misses        101      101              
Flag Coverage Δ
unittests 83.89% <100.00%> (+0.28%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/FASTX.jl 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 0d81f16...1d76c18. Read the comment docs.

TransGirlCodes commented 3 years ago

That sounds right. In general, I think data processing shouldn't be done on FASTA.Record or FASTQ.Record - because these objects are not Julia native data structures, they're useful for IO, but not more.

Yes, I think I largely agree, with some exceptions for certain file processing tasks where parsing fully into native structures is overkill, I'm thinking, trimming, masking, converting etc. Which are simple, and we can implement them, so as for the user, they use the API to do those limited things for the most common file operations, but otherwise read into our native structures for serious analysis.

TransGirlCodes commented 3 years ago

Ok this looks like it is working fine, with the exception of for FASTQ files with empty sequences. I believe because FASTA parsing rejects it. So we may have to look at whether we allow a record in a file to have no sequence or not, and ensure it is consistent between FASTA and FASTQ.