erhard-lab / price

Improved Ribo-seq enables identification of cryptic translation events
11 stars 0 forks source link

Using ViewCIT #4

Closed TamaraO closed 6 years ago

TamaraO commented 6 years ago

Hi,

I am trying to use ViewCIT to extract various features from the .cit file. For example, I would like to get a bed file for all the ORFs. I used this command: gedi -e ViewCIT -m Bed -name 'd.getType()' Sample.orfs.cit > Sample.bed

It generated a bed file, but there are a lot fewer lines in the bed file than the tsv file:

wc -l Sample.orfs.tsv
49405 Sample.orfs.tsv

wc -l Sample.bed
14648 Sample.bed

Why is that? How do I get a bed file for all ORFs?

Then, I try to use another example from your Wiki:

/Price$ gedi -e ViewCIT -q 1+:7844747-7844783 -m bed -name 'd.getOrfType()' -score 'd.getActivities()[0]' Sample.orfs.cit
2018-03-21 13:51:10.902 INFO Command: gedi -e ViewCIT -q 1+:7844747-7844783 -m bed -name d.getOrfType() -score d.getActivities()[0] Sample.orfs.cit
2018-03-21 13:51:12.371 INFO Discovering classes in classpath
2018-03-21 13:51:12.754 INFO Preparing simple class references
2018-03-21 13:51:12.996 INFO Gedi 1.0.2 (JAR) startup
<eval>:1 TypeError: d.getOrfType is not a function
    at jdk.nashorn.internal.runtime.ECMAErrors.error(ECMAErrors.java:57)
    at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:213)
    at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:185)
    at jdk.nashorn.internal.runtime.ECMAErrors.typeError(ECMAErrors.java:172)
    at jdk.nashorn.internal.runtime.Undefined.lookup(Undefined.java:102)
    at jdk.nashorn.internal.runtime.linker.NashornLinker.getGuardedInvocation(NashornLinker.java:106)
    at jdk.nashorn.internal.runtime.linker.NashornLinker.getGuardedInvocation(NashornLinker.java:98)
    at jdk.internal.dynalink.support.CompositeTypeBasedGuardingDynamicLinker.getGuardedInvocation(CompositeTypeBasedGuardingDynamicLinker.java:176)
    at jdk.internal.dynalink.support.CompositeGuardingDynamicLinker.getGuardedInvocation(CompositeGuardingDynamicLinker.java:124)
    at jdk.internal.dynalink.support.LinkerServicesImpl.getGuardedInvocation(LinkerServicesImpl.java:154)
    at jdk.internal.dynalink.DynamicLinker.relink(DynamicLinker.java:253)
    at jdk.nashorn.internal.scripts.Script$Recompilation$12$20AAA$\^eval\_.L:1(<eval>:1)
    at jdk.nashorn.internal.runtime.ScriptFunctionData.invoke(ScriptFunctionData.java:663)
    at jdk.nashorn.internal.runtime.ScriptFunction.invoke(ScriptFunction.java:494)
    at jdk.nashorn.internal.runtime.ScriptRuntime.apply(ScriptRuntime.java:393)
    at jdk.nashorn.api.scripting.ScriptObjectMirror.call(ScriptObjectMirror.java:117)
    at gedi.util.nashorn.JSTriFunction.apply(JSTriFunction.java:52)
    at executables.ViewCIT.lambda$start$9(ViewCIT.java:244)
    at java.util.Iterator.forEachRemaining(Iterator.java:116)
    at executables.ViewCIT.start(ViewCIT.java:389)
    at executables.ViewCIT.main(ViewCIT.java:56)

Could you, please, provide a more detailed explanation of how to use ViewCIT? In particular, how to use -name 'd.getTranscriptId()'? What else can I write instead of getTranscriptId?

In particular, I would like to get a BED file for all ORFs, but in the Name column, I would like to have the ORF Id instead of the ORF type. How can I do that?

Thank you!

florianerhard commented 6 years ago

Dear Tamara,

the orfs.cit file contains the ORFs after filtering by FDR, the orfs.tsv file contains all ORFs (before filtering). I corrected the documentation on this. Do you really need a bed file for the unfiltered ORFs?

Sorry, the documentation on that was outdated, I corrected it. It's getType instead of getOrfType, and getTotalActivity() instead of d.getActivities()[0].

You can use getTranscriptId on the .index.cit file created by IndexGenome. What is interesting to you is what you can do with your ORFs, isn't it? To get the ORF id as in the tsv file, you can go for

gedi -e ViewCIT -m bed -q '8+:27491575-27528868' -name 'd.transcript+"_"+d.type+"_"+d.orfid' -score '-Math.log10(d.combinedP)' price/hsv.orfs.cit

This also gives you the (uncorrected) p value (-log10). (rounded, as bed needs an integer field there)

I think I am going to expand the documentation on this in the next few days...

Best regards, Florian

TamaraO commented 6 years ago

Thank you for the detailed response. What is used for FDR cutoff? I am worried that if it is too stringent, it might remove some ORFs that are borderline "real". I was planning to do the filtering myself.

TamaraO commented 6 years ago

It is still giving me an error when trying to get ORFid like you recommended:

price$ gedi -e ViewCIT -m bed -q -name 'd.transcript+"_"+d.type+"_"+d.orfid' -score '-Math.log10(d.combinedP)' Sample.orfs.cit
2018-03-21 15:54:59.298 INFO Command: gedi -e ViewCIT -m bed -q -name d.transcript+"_"+d.type+"_"+d.orfid -score -Math.log10(d.combinedP) Sample.orfs.cit
2018-03-21 15:55:01.357 INFO Discovering classes in classpath
2018-03-21 15:55:01.767 INFO Preparing simple class references
2018-03-21 15:55:02.032 INFO Gedi 1.0.2 (JAR) startup

File d.transcript+"_"+d.type+"_"+d.orfid does not exist!

ViewCIT <Options> <file>

Options:
 -l         Only list the number of elements per reference
 -q <location>      Only output elements overlapping the given query (i.e. whole reference or genomic region)
 -o <file>      Specify output file (Default: stdout)
 -m <mode>      output mode: Bed/Location/Cit/Genepred/Gtf (Default: location)
 -t <js/name>       name of a transformer or javascript function body transform reference genomic region (variable d is current data, ref the reference, reg the region; either return a data object or a reference genomic region object!)
 -name <js>     javascript function body to generate name (variable d is current data, ref the reference, reg the region)
 -score <js>        javascript function body to generate score (variable d is current data, ref the reference, reg the region)
 -filter <js>\t javascript function body returning true for entries to use  (variable d is current data, ref the reference, reg the region)

 -p         Show progress
 -h         Show this message
 -D         Output debugging information

Still not sure how to extract the ORF id into the BED file...

florianerhard commented 6 years ago

You forgot to remove the -q when calling ViewCIT! FDR cutoff can be specified when calling price with -fdr 0.1 (10% is the default)

Florian

TamaraO commented 6 years ago

Perfect!! It's working now! Thank you!

TamaraO commented 6 years ago

Hi Florian,

In the resulting bed file, the ORFs with one exon do not have values in columns 7-12:

1  567020    567161    ENST00000414273.1_1_ncRNA_0    3           +
1  763050    763155    ENST00000608189.4_1_ncRNA_0    1           +
1  2487899   2488094   ENST00000355716.4_1_uORF_3     3           +
1  2488103   2494712   ENST00000355716.4_1_CDS_0      7           +  0  0  0  8   69,109,126,156,91,143,32,126,                          0,1061,1678,3158,3959,5008,6200,6483,
1  2494853   2494946   ENST00000463471.6_1_ncRNA_12   1           +
1  3541576   3545167   ENST00000378344.6_1_Variant_0  12          +  0  0  0  5   261,92,177,154,195,                                    0,411,700,2487,3396,
1  2323330   2328573   ENST00000378513.7_1_uoORF_3    5           +  0  0  0  3   67,88,19,                                              0,3892,5224,
1  2327229   2334563   ENST00000605895.5_3_CDS_0      2147483647  +  0  0  0  6   81,105,100,79,136,90,                                  0,1325,3624,5066,6416,7244,

This causes tools like BEDTools to crash. It would be great if the bed file was consistent.

Thank you!

Tamara

florianerhard commented 6 years ago

I fixed ViewCIT. If you pull the most recent version from github (and use buildworkspace.bash to build it), BETTools should be able to handle the files! Best, Florian

TamaraO commented 6 years ago

Hi Florian,

Regarding the FDR and filtering for the final .cit file. If I set -fdr 1 flag, I still don't get all of the contents of .tsv in the .cit file, and many ORFs are filtered out. Why is that? How can I get all (or most) of them?

Thank you!

Tamara

florianerhard commented 6 years ago

Hi Tamara,

I guess that many of the ORFs will receive a corrected p value of 1, and I am filtering using a strict <. You could try specifying -fdr 2 (or anything else greater than 1)!

Best, Florian