Closed BingbingYuan closed 3 years ago
Hi Bingbing,
Unfortunately, the AGAPS and AMB masks are mandatory, sorry.
I've never had a user trying to forge a masked BSgenome package before so the documentation for that section in the vignette hasn't received as much attention as the section for BSgenome packages with bare sequences.
So in your case, you're going to have 3 masks per sequence: AGAPS, AMB, and RM.
This means you need to set nmask_per_seq
to 3 in your seed file.
Let me know if that helps.
H.
PS: I'll be offline most of the time for the next week or so. Thanks for your patience.
Hi Hervé, Thanks for your reply! I was able to create .agp files with FASTA2AGP downloaded from https://gitlab.pasteur.fr/GIPhy/FASTA2AGP.
With nmask_per_seq=3, the forgeMaskedBSgenomeDataPkg exited with the error below: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 290 did not have 15 elements In addition: Warning message: In .newEmptyMask(seqname, mask.width, mask.name, mask.desc, mask.desc) : No assembly gaps found for sequence "dd_Smes_g4_1" in this file. returning empty mask
But, dd_Smes_g4_1 in the mask directory has gaps: $ head -2 mask/dd_Smes_g4_1.agp scaffold_001 1 376345 1 W contig_001 1 376345 + scaffold_001 376346 376445 2 N 100 scaffold yes paired-ends
When I reduced the nmask_per_seq to 2, forgeMaskedBSgenomeDataPkg worked fine, and I was able to created masked package.
Since the BSgenome package mentioned "You don’t need any file for the AMB masks", so I don't have the AMB masked files.
Thanks, Bingbing
Hi Bingbing,
Sorry it's been a long time since your post above. forgeMaskedBSgenomeDataPkg()
uses IRanges::read.agpMask()
internally to parse the .agp
files and it seems that the function didn't like some of the files produced by FASTA2AGP. Unfortunately, without having access to these files it's impossible for me to tell what went wrong exactly. If you'd like me to take a look please make the files available. If not, then I'll close this issue.
Thanks and sorry again for getting back to you only now.
H.
Hi Hervé, I was able to build a masked BGgenome with agp created with FASTA2AGP. When I checked the package with "R CMD check", I only got one warning message related to "LaTeX errors when creating PDF version", no warning/error message related to agp. You mentioned that forgeMaskedBSgenomeDataPkg didn't like some files produced by FASTAAGP. What error message did you see? I replied to this email with my seed configuration file and content inside masks_srcdir today, but it didn't get delivered because of the file size. In this email, I only include one .agp as an example. Thanks,
On Tue, Dec 8, 2020 at 3:31 PM Bingbing Yuan byuan@wi.mit.edu wrote:
Hi Hervé, I was able to build a masked BGgenome with agp created with FASTA2AGP. When I checked the package with "R CMD check", I only got one warning message related to "LaTeX errors when creating PDF version", no warning/error message related to agp. I'm not sure what to look for. I attached the sequences (mask_seq.tar.gz) and the seed file (BSgenome.Smed.PlanMine.smedg4.masked-seed) in this email. Thanks,
On Tue, Dec 8, 2020 at 4:59 AM Hervé Pagès notifications@github.com wrote:
Hi Bingbing,
Sorry it's been a long time since your post above. forgeMaskedBSgenomeDataPkg() uses IRanges::read.agpMask() internally to parse the .agp files and it seems that the function didn't like some of the files produced by FASTA2AGP. Unfortunately, without having access to these files it's impossible for me to tell what went wrong exactly. If you'd like me to take a look please make the files available. If not, then I'll close this issue.
Thanks and sorry again for getting back to you only now.
H.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/BSgenome/issues/10#issuecomment-740517136, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACI44XSQ25VSB3XJIKUPDM3STX2H5ANCNFSM4PCS2VZA .
-- Bingbing
-- Bingbing
You mentioned that forgeMaskedBSgenomeDataPkg didn't like some files produced by FASTAAGP. What error message did you see?
This is only based on the error you reported in your earlier comment (from Jul 23). I don't see this error myself when I run IRanges::read.agpMask()
on the .agp
files that I have access to. As I said I could try to take a closer look but for this I would need to be able to reproduce the error that you got, and for this I would need access to the .agp
file that caused problems for you (dd_Smes_g4_1.agp
). Thanks!
H.
P.S.: Using the email interface for these discussions works but is not really convenient. It's usually easier to discuss via the web interface, especially for a discussion that spans several months like this one.
Thanks for the quick response! That error message occurred only when set "nmask_per_seq" to 3. If I used "nmask_per_seq" to 2, forgeMaskedBSgenomeDataPkg ran well without any error. Since the vignettes says "You don’t need any file for the AMB masks.", I assume that the nmask_per_seq is 2 ( I have .agp and repeat masked files). Does this make sense to you? Thanks,
On Tue, Dec 8, 2020 at 4:18 PM Hervé Pagès notifications@github.com wrote:
You mentioned that forgeMaskedBSgenomeDataPkg didn't like some files produced by FASTAAGP. What error message did you see?
This is only based on the error you reported in your earlier comment (from Jul 23). I don't see this error myself when I run IRanges::read.agpMask() on the .agp files that I have access to. As I said I could try to take a closer look but for this I would need to be able to reproduce the error that you got, and for this I would need access to the .agp file that caused problems for you (dd_Smes_g4_1.agp). Thanks!
H.
P.S.: Using the email interface for these discussions works but is not really convenient. It's usually easier to discuss via the web interface, especially for a discussion that spans several months like this one.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Bioconductor/BSgenome/issues/10#issuecomment-741047175, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACI44XT6TN6ERQRECRYSHJLST2JZHANCNFSM4PCS2VZA .
-- Bingbing
I don't know. But if you're saying that everything worked as expected and that there's no need for me to try to troubleshoot the problems you got when setting nmask_per_seq
to 3 then we should just close this issue.
H.
I was able to create the first package on bare sequences. Thanks for the well documented instruction! I have a problem when creating a masked package with RepeatMasker fa.out file. I got the error (below) when running forgeMaskedBSgenomeDataPkg, which suggested missing gap files (related to AGAPS masks). How could I tell the program to create a package only based on TM? Did I miss a parameter to specify "RM" to the program? I'm using BSgenome_1.54.0.
Here is the content of my seed file: Package: BSgenome.Smed.PlanMine.smedg4.masked Title: Full masked genome sequences for planaria (PlanMine version g4) Description: Full genome sequences for planaria was downloaded from PlanMine (2018). The sequences are the same as in BSgenome.Smed.PlanMine.smedg4, except that it has the mask of repeats from RepeatMasker (RM mask). Version: 0.1 RefPkgname: BSgenome.Smed.PlanMine.smedg4 organism_biocview: Schmidtea_mediterranea nmask_per_seq: 1 SrcDataFiles: RM masks: genome.fa.out masks_srcdir: /mypath/mask RMfiles_name: genome.fa.out
Here is the error message: forgeMaskedBSgenomeDataPkg("/mypath/BSgenome.Smed.PlanMine.smedg4.masked-seed) Creating package in /mypath/BSgenome.Smed.PlanMine.smedg4.masked Error in .guessGapFileCOL2CLASS(file) : unable to guess the column names in "gap" file '/mypath/mask/dd_Smes_g4_1_gap.txt', sorry In addition: Warning messages: 1: In file(file, "rt") : cannot open file '/mypath/mask/dd_Smes_g4_1_gap.txt': No such file or directory 2: In file(file, "rt") : cannot open file '/mypath/mask/dd_Smes_g4_1_gap.txt': No such file or directory
Thanks, Bingbing