Proposed contribution task for Outreachy applicants: Register UCSC genome felCat9

Bioconductor / GenomeInfoDb

Utilities for manipulating chromosome names, including modifying them to follow a particular naming style

https://bioconductor.org/packages/GenomeInfoDb

31 stars 13 forks source link

Proposed contribution task for Outreachy applicants: Register UCSC genome felCat9 #49

Closed hpages closed 2 years ago

hpages commented 2 years ago

felCat9 is the latest UCSC genome for Cat (Felis catus). See "List of UCSC genome releases" at https://genome.ucsc.edu/FAQ/FAQreleases.html for all the genomes currently supported by UCSC.

Also check out the "Genome Browser Gateway" page here. This is the main entrance to the "UCSC Genome Browser". Find Cat in the UCSC species tree on the left, click on it, then make sure to select the latest Cat Assembly (felCat9). This will display a bunch of additional information about the felCat9 assembly.

Note that many UCSC genomes are already registered in the GenomeInfoDb package (83 as of October 2022). The registered_UCSC_genomes() function in GenomeInfoDb returns the list of all the UCSC genomes that are currently registered in the package. An important thing to be aware of is that getChromInfoFromUCSC() still works on an unregistered genome, but in "degraded" mode, that is:

the assembled.molecules argument is ignored,
the assembled and circular columns of the returned data.frame are filled with NAs,
and the chromosomes/sequences are not returned in any particular order.

Registering a genome fixes that. In other words, once a genome is registered in GenomeInfoDb, the information returned by getChromInfoFromUCSC() for that genome is guaranteed to be complete and accurate.

See ?getChromInfoFromUCSC (after loading GenomeInfoDb) for more information.

Registering a new UCSC genome is only a matter of adding a new file, called "registration file", to GenomeInfoDb/inst/registered/UCSC_genomes/. Note that the folder contains a README.TXT file that provides some brief information about what a "registration file" should contain (unfortunately the registration process is not fully documented).

For felCat9, since this is the first felCat genome that we're going to register in GenomeInfoDb, we need to start the felCat9.R file from scratch. However, looking at other registration files to get a feeling of how things are done is always a good idea. Don't bother with the NCBI_LINKER component for now. We'll add it later, once the corresponding NCBI assembly (Felis_catus_9.0) is also registered (registering Felis_catus_9.0 is the topic of issue #50).

IMPORTANT NOTES TO OUTREACHY APPLICANTS:

Make sure to complete all the Preliminary tasks listed here before you start working on this task. In particular, make sure that you have R 4.2 and that you are set up to use the devel version of Bioconductor (currently 3.16).
Only one applicant can work on this task. If you choose to work on this task, please make sure to assign yourself so other applicants know that the task is already being worked on. If later on you change your mind, please unassign yourself. It's ok to change your mind!
To work on this task, please fork the GenomeInfoDb repository. Then do your work on that fork.
Always test your changes before you commit them to your fork. This consists in installing the modified package, starting R, loading the package, and playing around with the new functionality. This process is called "ad hoc manual testing". Once everything behaves and looks as expected, run R CMD build and R CMD check on the package. Note that R CMD check should always be run on the source tarball produced by R CMD build.
R CMD check might produce some NOTEs and even some WARNINGs. These are ok if they existed before your changes. You can check that by taking a look at the daily report produced by our automated builds here: https://bioconductor.org/checkResults/devel/bioc-LATEST/ Make sure to not introduce new NOTEs or WARNINGs!
Once your work is ready to be merged, please submit a PR (Pull Request).
Remember to record your contribution on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

kakopo commented 2 years ago

Thank you @hpages. I would love to volunteer for this task

hpages commented 2 years ago

Excellent. Thank you!

hpages commented 2 years ago

@kakopo Here is some important information about genomes, genome assemblies, and UCSC sequence naming scheme, that will hopefully get you started.

About chromosomes and UCSC chromosome names

Each organism has its own set of chromosomes.

For example the chromosomes for Human are numbered from 1 to 22 (not counting the sex and mitochondrial chromosomes). The names that the UCSC people use for those chromosomes are chr1 to chr22. This is what you're going to see in all the hg* genomes at UCSC. Note that the hg* genomes are different versions of the Human genome, that is, versions that correspond to different assemblies that have been improved over time.

For Dog (canCam* genomes), the chromosomes are numbered from 1 to 38 (not counting the sex and mitochondrial chromosomes), and the UCSC people have named them chr1 to chr38.

For Worm (ce* genomes), there are only 5 chromosomes, and the UCSC people have named them chrI, chrII, chrIII, chrIV, and chrV (note the roman numbers).

For Fly (dm* genomes), the UCSC chromosome names are chr2L, chr2R, chr3L, chr3R, and chr4.

As you can see, the number of chromosomes and chromosome naming scheme can vary a lot between organisms!

You can see the chromosome names for a given UCSC genome by calling getChromInfoFromUCSC() on it. For example:

library(GenomeInfoDb)

chrominfo <- getChromInfoFromUCSC("hg38", assembled.molecules.only=TRUE)
chrominfo$chrom  # chromosome names for Human
#  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7" "chr8"  "chr9"
# [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
# [19] "chr19" "chr20" "chr21" "chr22" "chrX"  "chrY"  "chrM"

chrominfo <- getChromInfoFromUCSC("ce11", assembled.molecules.only=TRUE)
chrominfo$chrom  # chromosome names for Worm
# [1] "chrI"   "chrII"  "chrIII" "chrIV"  "chrV"   "chrX"   "chrM"

chrominfo <- getChromInfoFromUCSC("dm6", assembled.molecules.only=TRUE)
chrominfo$chrom
# [1] "chr2L" "chr2R" "chr3L" "chr3R" "chr4"  "chrX"  "chrY" "chrM"

Note that using assembled.molecules.only=TRUE is not strictly required. But if you don't use it, then the function will return a data frame with all the sequences that are in this particular version of the genome, not just the chromosomes. For example, for hg38, it will return a data frame with 640 rows instead of 25. For dm6, it will be 1870 rows instead of 8.

When using assembled.molecules.only=TRUE, only the "assembled molecules" are returned. That means: autosomes (see https://en.wikipedia.org/wiki/Autosome), sex chromosome(s) (i.e. chromosome X and/or Y), and mitochondrial chromosome. All the other sequences are usually "scaffolds", that is, they are small DNA fragments that were produced by the assembly process but the scientists working on this assembly were not able to merge those fragments into the chromosome sequences yet.

So your first goal is to find out what chromosomes are in the felCat9 genome. Note that you can use getChromInfoFromUCSC() for that, but, because felCat9 is not a registered genome yet, using assembled.molecules.only=TRUE for this genome doesn't work:

chrominfo <- getChromInfoFromUCSC("felCat9", assembled.molecules.only=TRUE)
Warning message:
# In .get_chrom_info_for_unregistered_UCSC_genome(genome, assembled.molecules.only = assembled.molecules.only,  :
#   'assembled.molecules' got ignored for unregistered UCSC genome felCat9
#   (don't know what the assembled molecules are for an unregistered UCSC
#   genome)

This means that we need to look at the full chrominfo$chrom vector, which is a character vector of 4507 sequence names! And we need to find the chromosome names in it. It might sound somewhat tedious, but it's actually not going to be too bad if we know a little bit about UCSC sequence naming scheme. Two important things to know about this naming scheme:

UCSC uses sequence names made of several parts separated by underscores.
For chromosomes, they use names made of a single part.

They have other rules that we'll discuss later, but those two are the most important ones and they should help get you started.

Once you've identified the chromosome names for felCat9 (either visually or programmatically), another difficulty is to decide in what order to put those names in ASSEMBLED_MOLECULES. Like you, I'm new to the Cat genome, so I took a quick look at chrominfo$chrom, and I saw the following:

table(lengths(strsplit(chrominfo$chrom, split="_")))
#    1    3    4
#   20 4142  346

This tells me that felCat9 contains 20 chromosomes and 4488 scaffolds (4142+346). To see the chromosome names, you can do something like this:

idx1 <- which(lengths(strsplit(chrominfo$chrom, split="_")) == 1L)
chrominfo$chrom[idx1]

The order we usually want to follow is: autosomes first, then sex chromosome(s), then mitochondrial chromosome.

Once you have figured these things out, you should be able to set ASSEMBLED_MOLECULES in felCat9.R.

Let me know how this is going and do not hesitate to ask questions. We will need to discuss more things about genomes, genome assemblies, and UCSC sequence naming scheme. I'll do my best to help.

kakopo commented 2 years ago

Thank you so much @hpages. This really helped me, and has been extremely enlightening on so many levels!

kakopo commented 2 years ago

I have been able to run most of the tests on my script successfully, save for executing FETCH_ORDERED_CHROM_SIZES(), which still doesn't return the expected dataframe. Finally, running R CMD build GenomeInfoDb produces the following error via cmd

* checking for file ‘GenomeInfoDb/DESCRIPTION’ ... OK
* preparing ‘GenomeInfoDb’:
* checking DESCRIPTION meta-information ... OK
Warning in file(con, "r") :
  cannot open file 'man': No such file or directory
 ERROR
computing Rd index failed:cannot open the connection

The same command worked fine on the BSgenome file so I'm unsure as to what might have gone wrong

kakopo commented 2 years ago

This is the current state of my code, and nothing else in the package file has been tampered with

ORGANISM <- "Felis catus"
ASSEMBLED_MOLECULES <- paste0("chr", c(1:18, "X", "M"))
CIRC_SEQS <- "chrM"

library(IRanges)       # for CharacterList()
library(GenomeInfoDb)  # for fetch_chrom_sizes_from_UCSC()

.order_seqlevels <- function(seqlevels)
{
  tmp <- CharacterList(strsplit(seqlevels, "_"))
  npart <- lengths(tmp)
  stopifnot(all(npart %in% c(1L, 3L, 4L)))

  idx1 <- which(npart == 1L)
  stopifnot(length(idx1) == length(ASSEMBLED_MOLECULES))
  oo1 <- match(ASSEMBLED_MOLECULES, seqlevels[idx1])
  stopifnot(!anyNA(oo1))
  idx1 <- idx1[oo1]

  idx3 <- which(npart == 3L)
  m3 <- matrix(unlist(tmp[idx3]), ncol=3L, byrow=TRUE)
  stopifnot(all(m3[ , 1L] == "chrUn"))
  oo3 <- order(m3[ , 3L])
  idx3 <- idx3[oo3]

  idx4 <- which(npart == 4L)
  m4 <- matrix(unlist(tmp[idx4]), ncol=4L, byrow=TRUE)
  stopifnot(all(m4[ , 1L] == "random"))
  oo4 <- order(m4[ , 4L])
  idx4 <- idx4[oo4]

  c(idx1, idx3, idx4)
}

FETCH_ORDERED_CHROM_SIZES <-
  function(goldenPath.url=getOption("UCSC.goldenPath.url"))
  {
    chrom_sizes <- GenomeInfoDb:::fetch_chrom_sizes_from_UCSC(GENOME,
                                                              goldenPath.url=goldenPath.url)
    oo <- .order_seqlevels(chrom_sizes[ , "chrom"])
    S4Vectors:::extract_data_frame_rows(chrom_sizes, oo)
  }

hpages commented 2 years ago

Thank you so much @hpages. This really helped me, and has been extremely enlightening on so many levels!

Glad it helped!

Please edit your last comment and replace the single backticks with triple backticks so that your code gets properly displayed. Single backticks are for inline code. See here for how to use triple backticks for code blocks in Markdown.

About this (from your code above):

ASSEMBLED_MOLECULES <- paste0("chr", c(1:18, "X", "M"))

This produces the following ASSEMBLED_MOLECULES vector:

ASSEMBLED_MOLECULES
#  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
# [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
# [19] "chrX"  "chrM"

Are you sure those are the chromosome names in the felCat9 genome?

About this:

The same command worked fine on the BSgenome file so I'm unsure as to what might have gone wrong.

Not sure what a "BSgenome file" is. Also it's too early to try to run R CMD build GenomeInfoDb. The R CMD build and R CMD check steps are usually the very last steps before a commit. Before we run them, we need to validate our changes via some "ad hoc manual testing".

kakopo commented 2 years ago

@hpages I think I am finally getting the hang of things. Here is the edited code to fix the ASSEMBLED_MOLECULES bit

GENOME <- "felCat9"
ORGANISM <- "Felis catus"
ASSEMBLED_MOLECULES <- paste0("chr",
                              c("A1", "A2", "A3", "B1", "B2", "B3", "B4", "C1", "C2", "D1", "D2", "D3", "D4", "E1", "E2", "E3", "F1", "F2", "X", "M"))
CIRC_SEQS <- "chrM"

library(IRanges)       # for CharacterList()
library(GenomeInfoDb)  # for fetch_chrom_sizes_from_UCSC()

.order_seqlevels <- function(seqlevels)
{
  tmp <- CharacterList(strsplit(seqlevels, "_"))
  npart <- lengths(tmp)
  stopifnot(all(npart %in% c(1L, 3L, 4L)))

  idx1 <- which(npart == 1L)
  stopifnot(length(idx1) == length(ASSEMBLED_MOLECULES))
  oo1 <- match(ASSEMBLED_MOLECULES, seqlevels[idx1])
  stopifnot(!anyNA(oo1))
  idx1 <- idx1[oo1]

  idx3 <- which(npart == 3L)
  m3 <- matrix(unlist(tmp[idx3]), ncol=3L, byrow=TRUE)
  stopifnot(all(m3[ , 1L] == "chrUn"))
  oo3 <- order(m3[ , 3L])
  idx3 <- idx3[oo3]

  idx4 <- which(npart == 4L)
  m4 <- matrix(unlist(tmp[idx4]), ncol=4L, byrow=TRUE)
  stopifnot(all(m4[ , 4L] == "random"))
  oo4 <- order(m4[ , 4L])
  idx4 <- idx4[oo4]

  c(idx1, idx3, idx4)
}

FETCH_ORDERED_CHROM_SIZES <-
  function(goldenPath.url=getOption("UCSC.goldenPath.url"))
  {
    chrom_sizes <- GenomeInfoDb:::fetch_chrom_sizes_from_UCSC(GENOME,
                                                              goldenPath.url=goldenPath.url)
    oo <- .order_seqlevels(chrom_sizes[ , "chrom"])
    S4Vectors:::extract_data_frame_rows(chrom_sizes, oo)
  }

I have successfully run the tests referenced in the README.TXT file as well. The BSgenome file I am referencing is the one listed here https://github.com/Bioconductor/BSgenomeForge/wiki/List-of-contribution-tasks-for-the-Outreachy-application-period which helped me get familiar with the R CMD build and R CMD check command

hpages commented 2 years ago

Hi @kakopo ,

Here is the edited code to fix the ASSEMBLED_MOLECULES bit

That's it! Glad to see that you are making good progress.

Also it's great that you were able to come up with a working FETCH_ORDERED_CHROM_SIZES() function. Now we need to discuss the order in which the sequences are returned. This is a delicate topic and I will try to provide as much details as I can in the next comment below.

The BSgenome file I am referencing is the one listed here ...

I guess you meant the BSgenome package, not file. I see now. BTW the link you provide in your comment above doesn't work. Can you please fix it? Thanks

hpages commented 2 years ago

@kakopo

Let's take a closer look at all the sequences in felCat9:

library(GenomeInfoDb)

## Note that we sometimes use "seqlevels" as a synomym for "sequence names".
seqlevels <- getChromInfoFromUCSC("felCat9")$chrom

## Break the seqlevels (a.k.a. sequence names) in parts and count the number of parts in each seqlevel.
## The result 'npart' is an integer vector parallel to 'seqlevels' (i.e. same length, and the i-th element
## in 'npart' corresponds to the i-th element in 'seqlevels').
npart <- lengths(strsplit(seqlevels, "_"))

table(npart)
# npart
#    1    3    4 
#   20 4142  346

So, based on the number of parts in the sequence names, we can distinguish 3 groups of sequences. We already know that the first group contains all the so-called "assembled molecules". Now let's take a look at the 2 other groups.

## seqlevels made of 3 parts:
idx3 <- which(npart == 3L)
head(seqlevels[idx3])
# [1] "chrUn_NW_019365585v1" "chrUn_NW_019365586v1" "chrUn_NW_019365587v1"
# [4] "chrUn_NW_019365588v1" "chrUn_NW_019365589v1" "chrUn_NW_019365590v1"

## seqlevels made of 4 parts:
idx4 <- which(npart == 4L)
head(seqlevels[idx4])
# [1] "chrX_NW_019365559v1_random" "chrX_NW_019365560v1_random"
# [3] "chrX_NW_019365561v1_random" "chrX_NW_019365562v1_random"
# [5] "chrX_NW_019365563v1_random" "chrX_NW_019365564v1_random"

3-part seqlevels: It seems that the seqlevels made of 3 parts are of the form chrUn_NW_xxxxxxx. This observation is based on the first 6 seqlevels in the group so we don't know if this applies to all the seqlevels in the group, but let's assume that this is the case for now (we'll confirm this later). The sequences in this group are called "unplaced scaffolds". Those are small DNA fragments that were produced by the assembly process, but the scientists working on this assembly were not able to identify which chromosome those fragments are coming from. UCSC naming convention for "unplaced scaffolds" is to prefix their names with chrUn_.

4-part seqlevels: It seems the seqlevels made of 4 parts are of the form <chromosome-name>_NW_xxxxxxx_random. Again, this observation is based on the first 6 seqlevels in the group so we don't know if this applies to all the seqlevels in the group, but let's assume that this is the case for now (we'll confirm this later). The sequences in this group are called "unlocalized scaffolds". Those are very similar to "unplaced scaffolds", that is, they are also small DNA fragments that were produced by the assembly process. However, for "unlocalized scaffolds", the scientists working on this assembly were actually able to identify which chromosome those fragments are coming from, but they are not sure exactly from which location on the chromosome. UCSC naming convention for "unlocalized scaffolds" is to prefix their names with the corresponding chromosome name and to suffix them with _random.

Back to FETCH_ORDERED_CHROM_SIZES(): The function should return the sequences in the following order:

Assembled molecules
Unlocalized scaffolds (<chromosome-name>_NW_xxxxxxx_random)
Unplaced scaffolds (chrUn_NW_xxxxxxx)

Right now your FETCH_ORDERED_CHROM_SIZES() function returns the unlocalized scaffold after the unplaced scaffolds So you need to make the necessary change to return them before the unplaced scaffolds. Note that all the "ordering" work is actually performed by the .order_seqlevels() function so this is where you want to make the change. It should be a very simple change.

Then we'll discuss the order of the seqlevels within each group.

As always, make sure to let me know if you have questions.

kakopo commented 2 years ago

@hpages I've edited the link, and yes, I did mean package, not file. Sorry for the mix up! Below is the current .order_seqlevels() function, which allows the unlocalized scaffolds to come before the unplaced scaffolds


.order_seqlevels <- function(seqlevels)
{
  tmp <- CharacterList(strsplit(seqlevels, "_"))
  npart <- lengths(tmp)
  stopifnot(all(npart %in% c(1L, 3L, 4L)))

  idx1 <- which(npart == 1L)
  stopifnot(length(idx1) == length(ASSEMBLED_MOLECULES))
  oo1 <- match(ASSEMBLED_MOLECULES, seqlevels[idx1])
  stopifnot(!anyNA(oo1))
  idx1 <- idx1[oo1]

  idx4 <- which(npart == 4L)
  m4 <- matrix(unlist(tmp[idx4]), ncol=4L, byrow=TRUE)
  stopifnot(all(m4[ , 4L] == "random"))
  oo4 <- order(m4[ , 4L])
  idx4 <- idx4[oo4]

  idx3 <- which(npart == 3L)
  m3 <- matrix(unlist(tmp[idx3]), ncol=3L, byrow=TRUE)
  stopifnot(all(m3[ , 1L] == "chrUn"))
  oo3 <- order(m3[ , 3L])
  idx3 <- idx3[oo3]

  c(idx1, idx4, idx3)
}

hpages commented 2 years ago

Great! Really good progress here.

Let's now discuss the order of the seqlevels within each group.

Unlocalized scaffolds (<chromosome-name>_NW_xxxxxxx_random) should be ordered first by chromosome (part 1), then by xxxxxxx (part 3).
Unplaced scaffolds (chrUn_NW_xxxxxxx) should be ordered by xxxxxxx (part 3).

Let's start with the unplaced scaffolds, since they seem to be a little bit easier to deal with:

The good news is that you actually have this right :+1:

These 2 lines:

oo3 <- order(m3[ , 3L])
idx3 <- idx3[oo3]

are actually taking care of putting the idx3 index in an order that follows the lexicographic order of part 3 (the last part in the seqlevels).

You also have a sanity check:

stopifnot(all(m3[ , 1L] == "chrUn"))

This makes sure that, for all the seqlevels in this group, the first part in the seqlevel is chrUn. I would suggest that you do the same for the middle part, that is, that you add another sanity check that makes sure that the middle part is NW for all the seqlevels in the group. Remember how we assumed that all the seqlevels in the group are of the form chrUn_NW_xxxxxxx? With those two sanity checks in place, we'll make sure that this is actually the case.

Now let's take care of the unlocalized scaffolds:

This is becoming a little bit tricky. So instead of trying to explain here how to write the code that will put the unlocalized scaffolds in the right order, I'll point you to this registration file where this problem has been solved already. Note that for the galGal5 genome, they use NT instead of NW in the seqlevels (for part 2), so you will need to adapt the code.

Once you are satisfied with your .order_seqlevels() function and your felCat9.R file in general, please perform some basic testing as explained in the README.TXT file in the GenomeInfoDb/inst/registered/UCSC_genomes/ folder.

If you're confident that everything looks good, then please proceed with the R CMD build and R CMD check steps.

If that goes as expected, then commit your work and submit a PR. Don't forget to add the felCat9.R file with git add felCat9.R before you commit.

Thanks!

kakopo commented 2 years ago

Thank you. I've carried out all the necessary edits and tests, and everything looks good. I am curious as to the role of this line m41 <- match(m4[ , 1L], ASSEMBLED_MOLECULES) in the code - it seemed to function well without it, but I'm sure its important nonetheless. My current issue is that R CMD build GenomeInfoDb has simply failed to work as expected. This is its output via terminal

* checking for file ‘GenomeInfoDb/DESCRIPTION’ ... OK
* preparing ‘GenomeInfoDb’:
* checking DESCRIPTION meta-information ... OK
Warning in file(con, "r") :
  cannot open file 'man': No such file or directory
 ERROR
computing Rd index failed:cannot open the connection

which has been quite weird as R CMD build worked on the other packages. None of the searches I've carried out seem to address this issue in regards to this command either

hpages commented 2 years ago

I am curious as to the role of this line m41 <- match(m4[ , 1L], ASSEMBLED_MOLECULES) in the code - it seemed to function well without it, but I'm sure its important nonetheless.

That's a great question! I'll try to explain.

We could do:

idx4 <- which(npart == 4L)
m4 <- matrix(unlist(tmp[idx4]), ncol=4L, byrow=TRUE)
stopifnot(all(m4[ , 2L] == "NW"))
stopifnot(all(m4[ , 4L] == "random"))
oo4 <- order(m4[ , 1L], m4[ , 3L])
idx4 <- idx4[oo4]

and maybe that seems to work well. Note that m4[ , 1L] is a character vector containing chromosome names (part 1 of the seqlevels), and m4[ , 3L] is another character vector containing part 3 of the seqlevels (the xxxxxxx part). So when we compute the ordering with oo4 <- order(m4[ , 1L], m4[ , 3L]), we order by chromosome name first, then by xxxxxxx (part 3). Note that the ordering of character vectors follows lexicographic order, which is system/region dependent.

Let's compare with this:

idx4 <- which(npart == 4L)
m4 <- matrix(unlist(tmp[idx4]), ncol=4L, byrow=TRUE)
m41 <- match(m4[ , 1L], ASSEMBLED_MOLECULES)
stopifnot(!anyNA(m41))
stopifnot(all(m4[ , 2L] == "NW"))
stopifnot(all(m4[ , 4L] == "random"))
oo4 <- order(m41, m4[ , 3L])
idx4 <- idx4[oo4]

The difference here is that instead of using m4[ , 1L] in the call to order(), now we use m41, which was obtained with m41 <- match(m4[ , 1L], ASSEMBLED_MOLECULES). Note that m41 is an integer vector parallel to m4[ , 1L], but instead of containing chromosome names, it contains chromosome ranks, that is, the place of the chromosome name in the ASSEMBLED_MOLECULES vector. So now when we compute the ordering with oo4 <- order(m41, m4[ , 3L]), we order by chromosome rank first, then by xxxxxxx (part 3).

To summarize:

## Order by chromosome _name_ first:
oo4 <- order(m4[ , 1L], m4[ , 3L])

## Order by chromosome _rank_ first:
oo4 <- order(m41, m4[ , 3L])

The latter is what we really want.

The fact that, for the felCat9 genome, order(m4[ , 1L], m4[ , 3L]) produces the same result as order(m41, m4[ , 3L]) is because the chromosome names in ASSEMBLED_MOLECULES are already in lexicographic order, which just happened by chance. It's not even completely true: the last 2 names in ASSEMBLED_MOLECULES are chrX and chrM, so not in lexicographic order, but it doesn't make a difference because no unlocalized scaffolds (<chromosome-name>_NW_xxxxxxx_random) is associated with chrM.

In other words, the reason order(m4[ , 1L], m4[ , 3L]) produces the same result as order(m41, m4[ , 3L]) for felCat9 is because, for this genome, the subset of chromosome names in ASSEMBLED_MOLECULES that are represented in m4[ , 1L] is in lexicographic order. We don't want to rely on that kind of luck!

My current issue is that R CMD build GenomeInfoDb has simply failed to work as expected.

Please add felCat9.R to your fork (with git add felCat9.R), then commit and push. Let me know when you're done, and I'll take a look at your fork to try and figure out what's going on.

kakopo commented 2 years ago

@hpages when you mentioned the fork, I realized the issue might be with it and not the package. I deleted everything and started from scratch- and it worked! I've been able to run R CMD build GenomeInfoDb successfully, thank you a tonne! I've installed the necessary packages after running R CMD build GenomeInfoDb --no-build-vignettes the first time, but running R CMD build GenomeInfoDb the second time brings the following output;

* checking for file ‘GenomeInfoDb/DESCRIPTION’ ... OK
* preparing ‘GenomeInfoDb’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... ERROR
--- re-building ‘Accept-organism-for-GenomeInfoDb.Rnw’ using knitr
Loading required package: BiocGenerics

Attaching package: 'BiocGenerics'

The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs

The following objects are masked from 'package:base':

    Filter, Find, Map, Position, Reduce, anyDuplicated, aperm,
    append, as.data.frame, basename, cbind, colnames, dirname,
    do.call, duplicated, eval, evalq, get, grep, grepl, intersect,
    is.unsorted, lapply, mapply, match, mget, order, paste, pmax,
    pmax.int, pmin, pmin.int, rank, rbind, rownames, sapply,
    setdiff, sort, table, tapply, union, unique, unsplit, which.max,
    which.min

Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'

The following objects are masked from 'package:base':

    I, expand.grid, unname

Loading required package: IRanges
Warning in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet,  :
  texi2dvi script/program not available, using emulation
Error: processing vignette 'Accept-organism-for-GenomeInfoDb.Rnw' failed with diagnostics:
unable to run pdflatex on 'Accept-organism-for-GenomeInfoDb.tex'
LaTeX errors:
! LaTeX Error: File `beramono.sty' not found.

Type X to quit or <RETURN> to proceed,
or enter new name. (Default extension: sty)

! Emergency stop.
<read *> 

l.87 \RequirePackage
                    [T1]{fontenc}^^M
!  ==> Fatal error occurred, no output PDF file produced!
--- failed re-building ‘Accept-organism-for-GenomeInfoDb.Rnw’

--- re-building ‘GenomeInfoDb.Rnw’ using knitr
Loading required package: GenomicFeatures
Loading required package: GenomicRanges
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Warning in texi2dvi(file = file, pdf = TRUE, clean = clean, quiet = quiet,  :
  texi2dvi script/program not available, using emulation
Error: processing vignette 'GenomeInfoDb.Rnw' failed with diagnostics:
unable to run pdflatex on 'GenomeInfoDb.tex'
LaTeX errors:
! LaTeX Error: File `beramono.sty' not found.

Type X to quit or <RETURN> to proceed,
or enter new name. (Default extension: sty)

! Emergency stop.
<read *> 

l.87 \RequirePackage
                    [T1]{fontenc}^^M
!  ==> Fatal error occurred, no output PDF file produced!
--- failed re-building ‘GenomeInfoDb.Rnw’

SUMMARY: processing the following files failed:
  ‘Accept-organism-for-GenomeInfoDb.Rnw’ ‘GenomeInfoDb.Rnw’

Error: Vignette re-building failed.
Execution halted

The explanation for the relevance of m41 <- match(m4[ , 1L], ASSEMBLED_MOLECULES) is so helpful too! I'd noticed a few discrepancies with the order of the character vectors, but I didn't think too much of them because they'd come and go. Its definitely best not to rely on luck, I'll keep this in mind in the future.

hpages commented 2 years ago

This looks like a TeX/LaTeX issue. TeX/LaTeX is required to build the Sweave vignettes contained in the package. Vignettes are documents located in the vignettes/ folder of an R package. There are 2 types of vignettes: Sweave vignettes (extension .Rnw) and R Markdown vignettes (.Rmd extension). Only Sweave vignettes require TeX/LaTeX.

My understanding is that you are on Linux. If you are on Ubuntu, or other Debian-like system, make sure to install all the following packages:

texlive
texlive-font-utils
texlive-pstricks
texlive-latex-extra
texlive-fonts-extra
texlive-bibtex-extra
texlive-science
texlive-luatex
texlive-lang-european
texi2html
texinfo
pandoc
pandoc-citeproc
biber

You can install them with sudo apt-get install ...

Then try R CMD build GenomeInfoDb again.

Let me know how that goes.

kakopo commented 2 years ago

@hpages it worked! Running R CMD check GenomeInfoDb_1.33.13.tar.gz brought about this error, similar to an error I'd seen on slack.

Error in BiocGenerics:::testPackage("GenomeInfoDb") : 
    unit tests failed for package GenomeInfoDb

I added --no-tests and everything ran smoothly. Thank you so much for the assistance throughout!

hpages commented 2 years ago

Excellent. That's great news!

Then I guess you're ready to commit, push, and create a PR (Pull Request). Don't forget to add felCat9.R to git (with git add felCat9.R) before you commit. Thanks!

hpages commented 2 years ago

I just merged PR #59. Congratulation @kakopo on your first contribution to Bioconductor! Don't forget to record it on Outreachy at https://www.outreachy.org/outreachy-december-2022-internship-round/communities/bioconductor/refactor-the-bsgenomeforge-tools/contributions/.

The next suggested task for you is #50. Whenever you are ready, go there and ask to be assigned. Thanks!