Variants shouldn't be map to reference sequences

ypriverol commented 3 years ago

@EamonnOCearnaigh

As I mentioned in the discussion the other day would be great if we have a logic in the tool that if the user uses the MM options -mm NUM to allow missmatches and the query peptide matches with 0 miss-matches and 1..2.. etc miss-matches have a way to discard the 1..2. .etc options. The use case is:

Peptide Query: AAAA -> Protein A in sequence ..AAAA... and Protein B in sequence ..AVAA.. we discard the second match because it has more probabilities (I will say 100%) that this is only a reference peptide. James mentioned that he is doing a two-step search to discard reference peptides first and then search with multiple gaps.

Would be great if we can implement that feature.

EamonnOCearnaigh commented 3 years ago

Sounds great - I'll see if I can plan out another edit to implement that this week. Hey also, James passed a maven command on to me but I am still facing errors when running pepgenome from the terminal. Would you have any ideas?

ypriverol commented 3 years ago

Which error do you get?

ypriverol commented 3 years ago

this command should work:

$ java -jar pepgenome-1.1.1-bin.jar
usage: Arguments: -fasta TRANSL -gtf ANNO -in *.tsv[,*.tsv] [-format OUTF]
                  [-merge TRUE/FALSE] [-source SRC] [-mm NUM] [-mmmode
                  TRUE/FALSE] [-species SPECIES] [-chr 0/1]
 -ann <arg>              Filepath for file containing genome annotation in
                         GTF or GFF3 format
 -chr <arg>              Export chr prefix Allowed 0, 1  (default: 0)
 -exco <arg>             Use exon coordinates rather than CDS (Unannotated
                         peptides)
 -fasta <arg>            Filepath for file containing protein sequences in
                         FASTA format
 -format <arg>           Select the output formats from gtf, gct, bed,
                         ptmbed, all or combinations thereof separated by
                         ',' (default all)
 -genome <arg>           Filepath for file containing genome sequence in
                         FASTA format used to extract chromosome names and
                         order and differenciate between assembly and
                         scaffolds. If not set chromosome and scaffold
                         names and order is extracted from GTF input.
 -gff <arg>              Filepath for file containing genome annotation in
                         GFF3 format
 -gtf <arg>              Filepath for file containing genome annotation in
                         GTF format
 -h                      Print this help & exit
 -in <arg>               Comma(,) separated file paths for files
                         containing peptide identifications (Contents of
                         the file can tab separated format. i.e., File
                         format: four columns: SampleName
                         PeptideSequence
                         PSMs
                         Quant; or mzTab, and mzIdentML)
 -inf <arg>              Format of the input file (mztab, mzid, pavro, or
                         tsv). (default tsv)
 -inm <arg>              Compute the kmer algorithm in memory or using
                         database algorithm (default 0, database 1)
 -merge <arg>            Set 'true' to merge mappings from all files from
                         input (default 'false')
 -mm <arg>               Allowed mismatches (0, 1 or 2; default: 0)
 -mmmode <arg>           Mismatch mode (true or false): if true
                         mismatching with two mismatches will only allow 1
                         mismatch every kmersize (default: 5) positions.
                         (default: false)
 -source <arg>           Please give a source name which will be used in
                         the second column in the output gtf file
                         (default: PoGo)
 -variant_filter <arg>   Peptide filter mode.

EamonnOCearnaigh commented 3 years ago

Hey, sorry for delay - job hunting at the minute since my placement contract is ending.

I'm having trouble actually generating the JAR. The JAR wasn't working when generated from the Iintellij Jjava application configurations that I was using for debugging. So, I switched it to a maven configuration and I'm trying to get it to build. I tried running "mvn install" on the command line which downloaded some files but resulted in:

[INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 06:38 min [INFO] Finished at: 2021-08-26T03:10:49+01:00 [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal on project pepgenome: Could not resolve dependencies for project org.bigbio.pgatk:pepgenome:jar:1.1.beta: Could not find artifact com.sun.java:tools:jar:13.0.2 at specified path C:\Program Files\Java\jdk-13.0.2/../lib/tools.jar

Everything is set to use the same version of java, any idea what's causing this? I didn't get this error during the testing stages at all.

ypriverol commented 3 years ago

I remove that dependency. Can you try now @EamonnOCearnaigh

EamonnOCearnaigh commented 3 years ago

[INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 8.108 s [INFO] Finished at: 2021-08-26T23:54:46+01:00 [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal on project pepgenome: Could not resolve dependencies for project io.github.bigbio:pepgenome:jar:1.1.1: Failed to collect dependencies at uk.ac.ebi.jmzidml:jmzidentml:jar:1.2.11 -> uk.ac.ebi.pride.architectural:pride-xml-handling:pom:1.0.3 -> psidev.psi.tools:xxindex:jar:0. 23 -> net.sourceforge.cpdetector:cpdetector:jar:1.0.7 -> net.sourceforge.jargs:jargs:jar:1.0: Failed to read artifact descriptor for net.sourceforge.jar gs:jargs:jar:1.0: Could not transfer artifact net.sourceforge.jargs:jargs:pom:1.0 from/to sonatype-release (https://oss.sonatype.org/service/local/stagi ng/deploy/maven2): authentication failed for https://oss.sonatype.org/service/local/staging/deploy/maven2/net/sourceforge/jargs/jargs/1.0/jargs-1.0.pom, status: 401 Unauthorized -> [Help 1]

EamonnOCearnaigh commented 3 years ago

I'm running "mvn install" on the command line. Is that the right command?

ypriverol commented 3 years ago

Yes, mvn install is fine. Which Java version do you have? Most of these issues are related with the Java version you are using.

ypriverol commented 3 years ago

I have added the dependency to our internal maven repo. Can you try again @EamonnOCearnaigh ?

EamonnOCearnaigh commented 3 years ago

It got further this time.

Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/struts/struts-core/1.3.8/struts-core-1.3.8.jar (329 kB at 270 kB/s) Downloaded from central: https://repo.maven.apache.org/maven2/org/apache/struts/struts-taglib/1.3.8/struts-taglib-1.3.8.jar (252 kB at 205 kB/s) [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 02:00 min [INFO] Finished at: 2021-08-27T18:29:02+01:00 [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal org.apache.maven.plugins:maven-javadoc-plugin:3.1.1:jar (attach-javadocs) on project pepgenome: MavenReportException: Err or while generating Javadoc: Unable to find javadoc command: The environment variable JAVA_HOME is not correctly set. -> [Help 1]

Also my Java version is:

java version "13.0.2" 2020-01-14 Java(TM) SE Runtime Environment (build 13.0.2+8) Java HotSpot(TM) 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)

ypriverol commented 3 years ago

Java doesn't find thw javadoc command. I have not idea why. Can you check in stackoverflow

EamonnOCearnaigh commented 3 years ago

Will do. For now, do you have a working JAR you could post to GitHub or send by email? We just need to place it into a NextFlow pipeline.

bigbio / pepgenome

Variants shouldn't be map to reference sequences #1