Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
7 stars 9 forks source link

Forge BSgenome data package for NCBI assembly Felis_catus_9.0 #48

Closed kakopo closed 1 year ago

kakopo commented 1 year ago

Contribution by Outreachy applicant kakopo. Added fasta_to_sorted_2bit_for_Felis_catus_9.0.R script, and BSgenome.Fcatus.NCBI.9.0 seed file.

hpages commented 1 year ago

Thanks for the PR @kakopo. Looks good.

I would only suggest a very minor change to the seed file:

You're already providing this long URL:

 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/181/335/GCF_000181335.3_Felis_catus_9.0/

in your SrcDataFiles field. So instead of providing it again in your source_url field, I would suggest that you provide the URL to the NCBI landing page for this assembly.

The assembly landing page is more informative e.g. it contains the release date and other important information about the assembly. It also provides the link to the "FTP directory for GenBank assembly", which is located at the long URL that you provide in your SrcDataFiles field. So from there, it's actually easy for anybody to find the sequence data.

The "How to forge a BSgenome data package" vignette had this for the source_url field:

The permanent URL where the sequence data files used to forge the target package can be found.

So yeah, it was not really helping, sorry! I just fixed that :smiley: (see commit 6444938f66450ddb6ef76e39bc20deb5e46dba37)

Other than that, I just used your fasta_to_sorted_2bit_for_Felis_catus_9.0.R script and BSgenome.Fcatus.NCBI.9.0-seed file to forge BSgenome.Fcatus.NCBI.9.0 on my laptop and everything went flawlessly. Congrats!

I know that you have questions in #43 about how the code in the fasta_to_sorted_2bit_for_Felis_catus_9.0.R script works exactly. I will answer them there.

kakopo commented 1 year ago

Done! Thank you!