(inactive) geno2proteo - Githubissues

yaoyong02 commented 6 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Repository: https://github.com/yaoyong02/geno2proteo

Confirm the following by editing each check box to '[x]'

[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.

I am familiar with the essential aspects of Bioconductor software management, including:

[x] The 'devel' branch for new packages and features.
[x] The stable 'release' branch, made available every six months, for bug fixes.
[x] Bioconductor version control using Git (optionally via GitHub).

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 6 years ago

Hi @yaoyong02

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: geno2proteo
Type: Package
Title: Finding the DNA and protein sequences in a list of genomic or proteomic loci
Version:0.99.0 
Date: 2017-01-24
Author: Yaoyong Li 
Maintainer: Yaoyong Li<liyaoyong85@gmail.com>
biocViews: Genetics, Proteomics, Sequencing, Annotation, GenomeAnnotation
Description: Using the DNA sequence and gene annotation files provided in ENSEMBL web site, 
   the functions implemented in the package try to find the DNA sequences and 
   protein sequences of a list of genomic loci, and to find the genomic coordinates 
   and protein sequences of a list of protein locations, which are the frequent
   tasks in the analysis of genomic and proteomic data.
License: Artistic-2.0
Depends: R (>= 2.3.0)
Imports: S4Vectors, BiocGenerics, GenomicRanges, IRanges, R.utils, RUnit, utils
SystemRequirements: Perl (>= 2.0.0)
LazyLoad: yes

mtmorgan commented 6 years ago

The vignette must contain code that is evaluated when processed (in <<>>= chunks, with eval=TRUE). The code should illustrate the overall function of the package.

yaoyong02 commented 6 years ago

ok I will try to add some codes to the vignette to demonstrate the overall function of the package.

yaoyong02 commented 6 years ago

I have added some codes with eval=TRUE into the vignette to illustrate the function of the package. I have just checked in these changes into my github repository. Please take a look at it.

bioc-issue-bot commented 6 years ago

Your package has been approved for building. Your package is now submitted to our queue.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/geno2proteo_buildreport_20171005110422.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

583fd0b Bump the version from 0.99.0 to 0.99.1 to trigger ...

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/geno2proteo_buildreport_20171006074003.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

15cd408 Bump the version to 0.99.2 to triger a new build

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/geno2proteo_buildreport_20171006092913.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

2f1b1ad Bumped the version to 0.99.3 to trigger a build

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/geno2proteo_buildreport_20171016062228.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

6940eb3 bumped the version number to 0.99.4 to triger a bu...

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/geno2proteo_buildreport_20171023062055.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

4a73b5c Bumped the version to 0.99.5 to triger another bui...

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/geno2proteo_buildreport_20171023065427.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

cb73309 Bumped the version number to 0.99.6 to trigger a n...

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/geno2proteo_buildreport_20171024055040.html

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

aac72f9 bumped the version number to 0.99.7 for triggering...

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/geno2proteo_buildreport_20171027064622.html

yaoyong02 commented 6 years ago

The package has been built without any error. It also passed both the R check and BiocCheck without any error, warning and note message. Therefore I think that the package is now ready for a close look by the reviewer.

mtmorgan commented 6 years ago

The EnsemblDb package seems to already provide a lot of the functionality of this package. For instance

library(EnsemblDb)
library(AnnotationHub)
## retreive the EnsDb data base; expensive the first time but not subsequently
ensdb <- query(AnnotationHub(), c("Ensembl 90 EnsDb", "Homo sapiens"))[[1]]

## get proteins (and other columns) for chr 16
prot <- proteins(edb, filter=SeqNameFilter("16"))
prot$protein_sequence <- Biostrings::AAStringSet(prot$protein_sequence)

leads to

> prot
DataFrame with 5003 rows and 4 columns
               tx_id      protein_id        protein_sequence    seq_name
         <character>     <character>           <AAStringSet> <character>
1    ENST00000304733 ENSP00000306407 MGETWKNICS...FLTVERPQED          16
2    ENST00000002501 ENSP00000002501 MEPPEGAGTG...FLTVERPQED          16
3    ENST00000392973 ENSP00000376699 MWEPCGSQGD...FLTVERPQED          16
4    ENST00000568838 ENSP00000457625 MTQQWREVNE...FLTVERPQED          16
5    ENST00000568330 ENSP00000456573 XPGRSVSLSL...QPPGLKRAMP          16
...              ...             ...                     ...         ...
4999       LRG_541t1       LRG_541p1 MALRVVRSVR...YEKYIKSIEE          16
5000       LRG_655t1       LRG_655p1 MASNDYTQQA...HRQDRRERPY          16
5001       LRG_674t1       LRG_674p1 MSRRKQAKPQ...VEDSKEIVTS          16
5002       LRG_689t1       LRG_689p1 MGGCAGSRRR...PLHDFLCQLS          16
5003       LRG_689t2       LRG_689p2 MGGCAGSRRR...PLHDFLCQLS          16

Some additional functionality is also readily available using existing resources, e.g.,

> query(AnnotationHub(), c("release-90", "Homo sapiens"))
snapshotDate(): 2017-10-27
AnnotationHub with 9 records
# snapshotDate(): 2017-10-27 
# $dataprovider: Ensembl
# $species: Homo sapiens
# $rdataclass: TwoBitFile, GRanges
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH57147"]]' 

            title                                           
  AH57147 | Homo_sapiens.GRCh38.90.abinitio.gtf             
  AH57148 | Homo_sapiens.GRCh38.90.chr.gtf                  
  AH57149 | Homo_sapiens.GRCh38.90.chr_patch_hapl_scaff.gtf 
  AH57150 | Homo_sapiens.GRCh38.90.gtf                      
  AH57450 | Homo_sapiens.GRCh38.cdna.all.2bit               
  AH57451 | Homo_sapiens.GRCh38.dna.primary_assembly.2bit   
  AH57452 | Homo_sapiens.GRCh38.dna_rm.primary_assembly.2bit
  AH57453 | Homo_sapiens.GRCh38.dna_sm.primary_assembly.2bit
  AH57454 | Homo_sapiens.GRCh38.ncrna.2bit

See also BSgenome::getSeq(), which could be used with the .2bit files and GenomicRanges from the txdb.

Please review your package, assessing whether it provides novel functionality thxat is not otherwise available. If there is novel functionality, please revise your package to use existing Bioconductor resources like those mentioned above, rather than the perl scripts and manual file downloads that you mention.

@jotsetung may be able to help with any specific questions about the ensembldb package and related resources.

jorainer commented 6 years ago

I had a look at the vignette, from the 4 described tasks of the package, 3 are relatively simple to achieve with ensembldb and other Bioconductor packages (i.e. get protein sequences for a genomic region, get DNA sequences, get AA sequences). Mapping of regions/ranges within a protein sequence to the genome has been addressed in the Pbase package but is not ideal yet:

> library(ensembldb)
> library(AnnotationHub)
> edb <- query(AnnotationHub(), c("Ensembl 90 EnsDb", "Homo sapiens"))[[1]]
snapshotDate(): 2017-10-27
loading from cache '/Users/jo//.AnnotationHub/64495'
> library(Pbase)
> ## Fetch the protein ENSP00000269305
> prot <- Proteins(edb, filter = ~ protein_id == "ENSP00000269305")

Now, this Proteins object contains also all protein domains within the protein sequence. These are represented as IRanges with start and end coordinates of the region/range within the protein sequence:

> pcols(prot)$ProteinDomains
IRangesList of length 1
$ENSP00000269305
IRanges object with 12 ranges and 3 metadata columns:
               start       end     width |      protein_id
           <integer> <integer> <integer> |     <character>
   PS00348       237       249        13 | ENSP00000269305
  SSF47719       319       357        39 | ENSP00000269305
  SSF49417        97       287       191 | ENSP00000269305
   PF07710       319       357        39 | ENSP00000269305
   PF00870        95       288       194 | ENSP00000269305
       ...       ...       ...       ... .             ...
   PR00386       116       142        27 | ENSP00000269305
   PR00386       158       179        22 | ENSP00000269305
   PR00386       213       234        22 | ENSP00000269305
   PR00386       264       286        23 | ENSP00000269305
   PR00386       236       258        23 | ENSP00000269305
           protein_domain_source interpro_accession
                     <character>        <character>
   PS00348           scanprosite          IPR002117
  SSF47719           superfamily          IPR010991
  SSF49417           superfamily          IPR008967
   PF07710                  pfam          IPR010991
   PF00870                  pfam          IPR011615
       ...                   ...                ...
   PR00386                prints          IPR002117
   PR00386                prints          IPR002117
   PR00386                prints          IPR002117
   PR00386                prints          IPR002117
   PR00386                prints          IPR002117

And any protein range within the Proteins object can be mapped back to the genome:

> mapToGenome(prot, edb)
Fetch coding region for proteins ... OK
GRangesList object of length 1:
$ENSP00000269305 
GRanges object with 26 ranges and 5 metadata columns:
           seqnames             ranges strand |           tx_id
              <Rle>          <IRanges>  <Rle> |     <character>
  SSF47719       17 [7670638, 7670715]      - | ENST00000269305
   PF07710       17 [7670638, 7670715]      - | ENST00000269305
   PR00386       17 [7670659, 7670715]      - | ENST00000269305
   PR00386       17 [7673535, 7673552]      - | ENST00000269305
  SSF47719       17 [7673535, 7673573]      - | ENST00000269305
       ...      ...                ...    ... .             ...
   PR00386       17 [7675994, 7676023]      - | ENST00000269305
  SSF49417       17 [7675994, 7676080]      - | ENST00000269305
   PF00870       17 [7675994, 7676086]      - | ENST00000269305
   PF08563       17 [7676391, 7676403]      - | ENST00000269305
   PF08563       17 [7676521, 7676582]      - | ENST00000269305
                                                                                                                                                                                                       pepseq
                                                                                                                                                                                                  <character>
  SSF47719                                                                                                                                                            KKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGK
   PF07710                                                                                                                                                            KKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGK
   PR00386                                                                                                                                                                          EYFTLQIRGRERFEMFRELNEALEL
   PR00386                                                                                                                                                                          EYFTLQIRGRERFEMFRELNEALEL
  SSF47719                                                                                                                                                            KKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGK
       ...                                                                                                                                                                                                ...
   PR00386                                                                                                                                                                        SGTAKSVTCTYSPALNKMFCQLAKTCP
  SSF49417    VPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEE
   PF00870 SSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEEN
   PF08563                                                                                                                                                                          QSDPSVEPPLSQETFSDLWKLLPEN
   PF08563                                                                                                                                                                          QSDPSVEPPLSQETFSDLWKLLPEN
                 accession exonJunctions     group
               <character>     <logical> <integer>
  SSF47719 ENSP00000269305          TRUE         2
   PF07710 ENSP00000269305          TRUE         4
   PR00386 ENSP00000269305          TRUE         7
   PR00386 ENSP00000269305          TRUE         7
  SSF47719 ENSP00000269305          TRUE         2
       ...             ...           ...       ...
   PR00386 ENSP00000269305          TRUE         8
  SSF49417 ENSP00000269305          TRUE         3
   PF00870 ENSP00000269305          TRUE         5
   PF08563 ENSP00000269305          TRUE         6
   PF08563 ENSP00000269305          TRUE         6

-------
seqinfo: 1 sequence from GRCh38 genome

Somehow this failed for the region you defined in your vignette (https://github.com/ComputationalProteomicsUnit/Pbase/issues/46) - I have to check this.

If you like you are welcome to discuss or add functionality to ensembldb: https://github.com/jotsetung/ensembldb (or Pbase https://github.com/ComputationalProteomicsUnit/Pbase).

yaoyong02 commented 6 years ago

Dear Martin,

First I am sorry for the late reply because I has been very busy last week.

Before I submitted the package to Bioconductor repository, I have asked an experienced colleague to have a look at this package. Interestingly, the initial feedback that I got from the colleague was almost the same as the comments you made, which has essentially two parts, namely that he couldn't see the novelty in the package and referred me to an existing Bioconductor package BSgenome, and that I should avoid using perl scripts and instead use the existing package such as BSgenome for getting the genomic data. After some explanations from me, my colleague was convinced that there are two novel tasks in this package, namely,

Given a set of any genomic regions (specified by their coordinates), extract the amino-acid sequence from overlapping coding exons.
Given a set of any amino-acids (assumed from known proteins), extract the original genome co-ordinates.

Regarding the novelty of this package, you mentioned the existing packages EnsemblDb and BSgenome, and @jotsetung also mentioned another package Pbase. As you mentioned, I believe that BSgenome::getSeq() can get DNA sequence of any specified genomic region. However, I think that two tasks listed above that my package can perform are the novelty of this package. I think that the keyword here is "any". As both you and @jotsetung illustrated, in their current implementation, by using EnsemblDb and Pbase you can get the protein sequences of any protein or any domain which exist in the ENSEMBL database, but you cannot get the protein sequences of any genomic region or any protein region specified by their coordinates, which has been demonstrated very well in the @jotsetung's experiment, as explained in the following.

As you can see in @jotsetung's experiment in https://github.com/ComputationalProteomicsUnit/Pbase/issues/46, by using EnsemblDb and Pbase, @jotsetung succeeded in obtaining the protein sequence of a domain by irl <- IRangesList(ENSP00000269305 = IRanges(start = 237, end = 249)), but failed to obtain the protein sequence of a protein region which can be seen as an arbitrary region, as defined by irl <- IRangesList(ENSP00000269305 = IRanges(start = 281, end = 391)). As @jotsetung noted, my package can obtain the protein sequence of both regions, exactly because my package can retrieve the protein sequences of any protein region or any genomic region. In another word, if you specify a protein by their ENSEMBL id, and any region in this protein by specifying the start and end coordinates, e.g. ENSEMBL_ID x1 x2 where x1 and x2 can be any number from 1 to the length of the protein as long as x1 <= x2, this package can obtain the protein sequence of the region, but EnsemblDb and Pbase can obtain the protein sequence only for the values of x1 and x2 corresponding to the one of the domains defined in the proteins or the whole protein. The same is true for retrieving the protein sequences from any genomic regions. I hope that you can see from this example what I meant by 'any' in the two novel tasks that my package can perform and the novelty of this package.

In summary on the novelty, among the four tasks that my package can perform, namely,

Given a list of any genomic regions specified by their coordinates on genome, how can the protein sequences be obtained that are coded by these regions?
Given a list of any protein regions specified by their coordinates on the protein, how can the genomic coordinates of those protein regions be obtained from the corresponding genome.
Given a list of any genomic regions, how can the DNA sequences be obtained for these regions?
Given a list of any protein regions, how can the amino acid sequences of these protein regions be obtained.

the task 3 can be done by BSgenome::getSeq(), the task 4 is simple if you already know the protein sequence of the related protein (which this package is not assumed), but I am not aware of any existing Bioconductor package or other software that can perform the tasks 1 and/or 2, which involve the mapping between a transcript's genomic structure (i.e. structure of its coding exons) and the corresponding protein sequences of those coding exons.

The purpose of the package is to provide four functions to perform four important tasks in bioinformatics, two of which are novel. The philosophy of the package is to be simple and straightforward. Regarding your comments on avoiding using perl scripts, firstly noted that some other Bioconductor packages have used perl scripts, such as EnsemblDb that you mentioned in your comments. Secondly noted that different computer languages have strength and weakness in implementing different tasks, and I used perl scripts because they are much more efficient than R in performing some functions that were needed in this package, just as some of other R packages did.

Regarding the manually downloading of the data files, as you can see in the documentation, actually one idea of the package is that a user can create and use his own, new data files containing the DNA sequences and transcript annotations of his choice, instead of the existing ones in e.g. the ENSEMBL database, which may be needed in some research. On the other hand, I can imagine that in most cases the user just needs the existing data files. People may have different opinions about which of the following is easier, manually downloading the data files from ENSEMBL web site for which the documentation has given the clear instructions about how to do it, or using the existing R packages such as EnsemblDb, AnnotationHub and BSgenome which presumably download and maybe encapsulate the data downloaded from the database web sites like ENSEMBL. Here is my personal experience of using EnsemblDb and AnnotationHub. Earlier this week in order to run the codes in your comments, I installed the two packages EnsemblDb and AnnotationHub in my linux computer without any problem, however, after loaded the two packages into my R session, I tried to run the first command in both your and @jotsetung's comments and got some error message, as in the following

library(ensembldb) library(AnnotationHub) edb <- query(AnnotationHub(), c("Ensembl 90 EnsDb", "Homo sapiens"))[[1]] snapshotDate(): 2017-10-27 Error in .AnnotationHub_get1(x[i]) : no records found for the given index In addition: Warning message: In rsqlite_fetch(res@ptr, n = n) : Don't need to call dbFetch() for statements, only for queries

I also tried in my laptop using Windows and a different R version and got the same error. This is one of the reasons that I prefer manually downloading the data files directly from the well-maintained ENSEMBL web site than using the packages which I will have to spend time to install, understand and use and don't always work.

Frankly, I have been installing and using many R packages including many in the Bioconductor repository to do the bioinformatic analysis works on our projects. For this I am grateful to the R repositories including Bioconductor. I have used the functions implemented in this package in several projects. I would like to share those functions with others, which is why I submitted the package to Bioconductor. I hope that you can see the novelty of this package from my above explanations. I wish to keep the manually downloading/creating part of this package, because I regard it as an advantage rather than disadvantage. Please let me know what you think.

Best regards,

Yaoyong

yaoyong02 commented 6 years ago

In terms of novelty, I think that a better key word here is 'anywhere'. Given any region or anywhere in the genome, represented by (Chr_name, start_coordinate, end_coordinate) where Chr_name is any chromosome in the genome and start_coordinate and end_coordinate can be absolutely any two positions in the chromosome as long as start_coordinate <= end_coordinate, the package can obtain all the protein sequences coded within the region if there is any. Similarly, given any region or anywhere in a protein sequence, represented by (protein_ENSEMBL_ID, start_coordinate, end_coordinate) where start_coordinate and end_coordinate are any two positions on the amino acid sequence of the protein, the package can get the genomic region from which the protein region specified by (protein_ENSEMBL_ID, start_coordinate, end_coordinate) was translated.

In contrast, as I understand and as @jotsetung has demonstrated in his experiment described in ComputationalProteomicsUnit/Pbase#46, the existing packages such as BSgenome, ensembldb and Pbase cannot perform the two tasks, namely obtaining the protein sequences coded within any genomic region and the genomic region corresponding to any protein regions. Instead, they can get the genomic regions corresponding to some pre-defined protein regions such as domains and the protein sequences of those pre-defined protein regions such as domains.

jorainer commented 6 years ago

Thanks Yaoyong for your explanations. Some comments from my side:

but failed to obtain the protein sequence of a protein region which can be seen as an arbitrary region Pbase can map arbitrary regions within a protein's sequence to the genome. It failed on this one and I am presently evaluating why this is the case (looks like a simple bug) and am porting the functionality into ensembldb.

but EnsemblDb and Pbase can obtain the protein sequence only for the values of x1 and x2 corresponding to the one of the domains defined in the proteins or the whole protein.

That's not correct. EnsDb databases contain the genomic coordinates of all transcript's exons, the CDS, the mapping between transcripts and proteins and the protein sequence. Thus it is possible to map any region within a protein sequence to the genome.

Regarding your comments on avoiding using perl scripts, firstly noted that some other Bioconductor packages have used perl scripts, such as EnsemblDb that you mentioned in your comments

Yes, ensembldb uses perl script, but only to generate EnsDb databases because I am using the Ensembl perl API to access and retrieve data from Ensembl's databases. The perl scripts are not used in the day-to-day use of ensembldb. If it wasn't for the Ensembl perl API I would not use perl scripts within an R package, because it can not be expected from any user to have perl installed on a system.

Regarding the problem you experienced with ensembldb and AnnotationHub: which version of R and Bioconductor did you use? With R version 3.4.2 and the related Bioconductor version 3.6 it works. Ensembl 90 EnsDb are not available for earlier versions of Bioconductor.

Regarding the manual download of files: in terms of reproducible research it is beneficial to retrieve data from versioned resources (AnnotationHub allows to retrieve such versioned databases (e.g. EnsDb databases, from a central repository). Based on my experience, relying on manually downloaded files can be problematic here, especially if analyses are shared within your work group or between work groups.

jo

mtmorgan commented 6 years ago

I'll try to get a more complete response later in the week, but want to say that the innovations you provide are both useful and appreciated. The primary concern is how these are exposed to the user, with a real avenue being additions or enhancements to existing packages that leverage current infrastructure, rather than addition of a new package that does not offer synergy with existing resources. From working with Johannes, I know that he's incredibly competent and responsive, and am sure he'd be pleased to make room for the novel components of your package as new code, vignettes, etc. The perl code is certainly an issue from the perspective of usability and project maintenance. Again I appreciate the novelty of the features you're introducing.

jorainer commented 6 years ago

As Martin mentioned, you're welcome to contribute code to https://github.com/jotsetung/ensembldb/. I started moving mapping code from Pbase to ensembldb (since it makes more sense there - Pbase is more related to working with proteins and peptides within protein sequences) in the map_to_genome branch (see the R/proteinToGenome.R file there).

yaoyong02 commented 6 years ago

Dear Martin,

many thanks for recognising the novelty of this package and suggesting an alternative way to share the novel functions. I have just spoken with our group leader Prof. Andy Sharrocks about it. We are thinking about it and will be back to you soon.

Best regards,

yaoyong

yaoyong02 commented 6 years ago

Dear Martin,

I have been thinking about your suggestion and discussed with Prof. Sharrocks as well. To be honest I didn't expect the suggestion when I started submitting the package to the Bioconductor repository. So it took us a while to think which is the best option. The option that I prefer is to keep the novel functions in the current package. The idea behind creating the R package was to create a simple package which focuses on one thing, namely obtaining the DNA sequences and protein sequences of any genomic or protein regions, amounting to four operations and corresponding to the four functions implemented in the package. I still think it's a good idea. My opinion is that from user's point of view, in many cases a small package focusing on one thing is better than a big package doing different things.

Regarding your other two concerns, namely using perl scripts and manually downloading data files, as you know, some other Bioc packages have also used perl scripts. In addition, my perl scripts do not need any specific operation or module so that any version of perl can be used to run the scripts. I believe that the vignettes has included the clear instructions about how to obtain those data files and the user should be able to locate and download the data files without any difficulty.

In summary, our package is bioinformatically relevant, contains some novel functions, and has passed the R and Bioc checks. I hope that the package can be included in Bioconductor repository so that it can be used by the wide users.

Best regards,

Yaoyong

mtmorgan commented 6 years ago

Is there a reason the perl scripts cannot be written in R, using the well-developed and computationally efficient infrastructure that the project already provides?

yaoyong02 commented 6 years ago

Dear Johannes,

I am sorry if I misinterpreted your experiment results in ComputationalProteomicsUnit/Pbase#46. I agree that if you already know or are given the amino acid sequence of a protein, it's almost a trial task to get any part of it by simply using a substring function. My package does not rely on the known protein sequences which I understand is a part of ENSEMBL annotations. Instead, given a protein coordinates, my package first finds the corresponding genomic region and then translates the coding structures within the genomic region into the amino sequence. Its implementation is quite simple -- basically it's a combination of two other functions in the package.

Best regards,

Yaoyong

yaoyong02 commented 6 years ago

I can write some R codes to replace the perl scripts, But as I said earlier, I think that the perl scripts are much more efficient than R for the tasks.

yaoyong02 commented 6 years ago

Dear Martin,

I have noted from the Bioc package guideline that the submitted package should make use of appropriate existing packages and classes. So I undertand where your suggestions about using the existing packages were from. Actually this package has used the Bioc package GenomicRanges from which this package has greatly benefitted. On the other hand, my opinion is that it is not necessary for this package to use other existing packages. Our purpose is to create a simple package with the straightforward functions. I think that using other existing packages will add one more layer to this package, making it more complicated, which is not consistent with our purpose. Moreover, I cannot see any great benefit that this package can get from using other packages. Therefore I would like not to use other existing packages.

As you may see, this package addresses some basic bioinformatic questions, contains some novel features, and is working well. I would like the package to be included in the Bioc repository so that it can be available to the wider user. On the other hand, I well understand that it's up to the reviewer to decide if it is accepted into the Bioc repository. So please let me know what you think.

Best regards,

Yaoyong

mtmorgan commented 6 years ago

The package won't be accepted with the perl script. I'm closing the issue, but feel free to add a comment if you change your mind.

Bioconductor / Contributions

(inactive) geno2proteo #508