carp-te / carp-documentation

Other
1 stars 1 forks source link

error with GenerateAnnotatedLibrary.java #14

Open jacau opened 6 years ago

jacau commented 6 years ago

Hi, I used carp-te on a plant genome, and everything went well until the last step of generating the library. The error I'm getting is:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1967) at GenerateAnnotatedLibrary.getAllRetroAnnotations(GenerateAnnotatedLibrary.java:531) at GenerateAnnotatedLibrary.writeConsensusSequences(GenerateAnnotatedLibrary.java:623) at GenerateAnnotatedLibrary.main(GenerateAnnotatedLibrary.java:468)

It seems to be an issue with the GB_TE.fa file which I downloaded using the efetch perl script provided in code. Also, I didn't do retrovirus protein sequence search, so I tried just passing the GB_TE files twice (is there a better way to handle this?).

Thanks, Jasmina

luzengAdelaide commented 6 years ago

Hi Jasmina, I guess the easy way is just to passing the GB_TE files twice. I tried and this didn't affect my output.

As you didn't do retrovirus protein sequence search, so there is no file called "notKnown.fa.ervwb.gff" as the input to the Java code. Can you please replace it to "notKnown.fa.tewb.gff" (the output of GB_TE search)? Like what you've done with GB_TE.fa, passing twice of it.

Please feel free to contact us if you have any further questions!

Cheers :-), Lu

jacau commented 6 years ago

Hi Lu,

I did that and the errors are the same.

luzengAdelaide commented 6 years ago

@jacau Hi Jacau, If it's possible, can you please share me the files you used to run this Java code?

luzengAdelaide commented 6 years ago

Hi Jacau,

The issue is caused by the different format of GB_TE data between yours and mine. NCBI keep updating their library, and changing their data format. In this case, instead of change the Java code, I guess it's easier to just make your GB_TE data consistent with ours. The only thing you need to do is to type the following command lines:

perl -pi -e "s/^>/>gi|GBTE|sp|/g" GBTE.fa
sed -i 's/ /| /' GBTE.fa

Please feel free to contact me if you've any further questions. :-)

KatharinaHoff commented 6 years ago

I have renamed the fasta headers in my gbte and retrovirus library files. Now, I get different error:

java GenerateAnnotatedLibrary
Exception in thread "main" java.lang.NullPointerException
    at GenerateAnnotatedLibrary$RBClassifier.getClassifications(GenerateAnnotatedLibrary.java:338)
    at GenerateAnnotatedLibrary$RBClassifier.<init>(GenerateAnnotatedLibrary.java:222)
    at GenerateAnnotatedLibrary$RBClassifier.<init>(GenerateAnnotatedLibrary.java:220)
    at GenerateAnnotatedLibrary.writeConsensusSequences(GenerateAnnotatedLibrary.java:625)
    at GenerateAnnotatedLibrary.main(GenerateAnnotatedLibrary.java:468)

Here is my GenerateAnnotatedLibrary configuration:

        private static String iDir = "./";
        private static String oDir = "library/";
        private static String library = oDir + "Denovo_TE_Library.fasta";
        private static String headers = oDir + "wantedCSHeaders.txt";
        private static String CSFile = iDir + "ConsensusSequences.fa";
        private static String TEgff = iDir + "results_classify/notKnown.fa.tewb.gff";
        private static String GBTE = "/nas-hs/db/gb_te/151018_GB_TE.fa";
        private static String ERVgff = iDir + "results_classify/notKnown.fa.ervwb.gff";
        private static String ALLR = "/nas-hs/db/retroviruses/sequence.rn.fasta";
        private static String SSR = iDir + "SSR.txt";
        private static String Proteins = "results_classify/protein.txt";
        private static String IRS = iDir + "ConsensusSequences.fa.map"; 
        private static String IRM = iDir + "results_classify/known.txt";

Here are some statistics and contents from the input files, maybe that will help tracking down the problem.

CSFile 171802 lines, 3801 sequences, head -3:

>family002795_consensus (2 members - 2 members within 0.95 of maximum length)
AATTCAATAATCAGCGATCTACAGGAACTGAAGTACTGCCGATTGGAACTGTCCACAAAA
TTATCAGGATTTGACAGGCAAAGGCAAATGGGTGATAAGTTTCACAAAGGAAAAGTTGCC

TEgff 292 lines, head -3:

family001882_consensus  blast   hit     10577   11170   1.47e-76        .       .       Target sp|CDS25417.2 1 206
family001802_consensus  blast   hit     2045    2584    6.02e-72        .       .       Target sp|CDS31061.1 11 190
family001984_consensus  blast   hit     1       561     1.29e-110       .       .       Target sp|CDS31628.2 309 495

GBTE 8891585 lines, 1064588 sequences, head -3:

>gi|GBTE|sp|AYJ71526.1| Taq DNA polymerase, partial [synthetic construct]
MEEMLPLFEPKGRVLLVDGHHLAYRTFHALKGLTTSRGEPVQAVYGFAKSLLKALKEDGDSVIVVFDAKA
PSFRHEAYEGYKARRAPTPEDFPRQLALIKELVDLLGLVRLEVPGYEADDVLASLAKKAEKEGYEVRILT

ERVgff 45 lines, head -3:

family002413_consensus  blast   hit 2351    2452    2.45e-06    .   .   Target sp|NC_027117.1 4650 4751
family002553_consensus  blast   hit 1018    1254    8.81e-13    .   .   Target sp|NC_001403.1 3263 3499
family003148_consensus  blast   hit 1570    2118    1.22e-36    .   .   Target sp|NC_039238.1 4384 4932

ALLR 10195 lines, 80 sequences, head -3:

>gi|GBTE|sp|NC_039242.1| Feline foamy virus DNA, complete genome
TGTCATGGGCCAAAGAGAATTCTCACAGAGGAGAATACTCTCTGCTGCCATCTAGTGACGATGAGGAAGA
AGAAATGTCAGAAAGAGAGGAATTATTGTGCCATATAAATCAGTGTCAACAAAAGCTCTTTTATCCCGGA

SSR 0 lines (empty file, size 0) ... I don't have any SSRs...

Proteins 663 lines, head -3:

Sequence MappedTo
family002496_consensus sp|sp|Q6GNY1|MIB1_XENLA
family002142_consensus sp|sp|Q7JQ07|MOS1T_DROMA

IRS 33 lines, head -3:

family000189_consensus  10767   10823   HAL1b   2199    2256    c   0   0   231
family000505_consensus  280 325 MARNA   335 381 c   0   0   231
family000597_consensus  457 764 HSMAR2  828 1137    d   0   0   431

IRM 1line: Sequence MappedTo

Java version openjdk 10.0.2 2018-07-17 OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.2) OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.2, mixed mode)

Information on organism I am running CARP on Hymenolepis microstoma (Genbank assembly, has 3643 sequences, 182136974 bp, GC content 35%. This is a tiny test genome. The goal is run it on a reptile once I figure out how to run it ;-)

Any thoughts on how to fix the java error?

Best,

Katharina

luzengAdelaide commented 6 years ago

Hi Katharina,

I may found the issue that caused the error. It seems the input of your "TEgff (notKnown.fa.tewb.gff)" is separated by spaces, rather tab-delimited. Can you please replace the spaces to one tab in this file, and rerun the code again?

The command below will help you with the replacement.

sed -i 's/ + /\t/g' notKnown.fa.tewb.gff

Cheers, Lu

KatharinaHoff commented 6 years ago

Dear Lu, that is not the source of the problem. If it looks like there are spaces, not tabs, that is a an artefact of copying the file contents here into github. In the file, there are tabs, already. (I nevertheless ran the suggested sed command, but it had no effect since there were no spaces, and thus, the problem remains.) I also checked the notKnown.fa.spwb.gff, and notKnown.fa.ervwb.gff, notKnown.fa.gff; tabs are there. What else might be the problem? Feel free to contact me via e-mail (katharina.hoff@uni-greifswald.de), I can make all files available to you via ftp. Best, Katharina

KatharinaHoff commented 6 years ago

Lu helped me figure out the problem. I had misunderstood the comment in the head of GenerateAnnotatedLibrary.java:

/home/a1635743/RepBase20.04.fasta/rep.ref <- I thought that refers to all the rep.ref files in /home/a1635743/RepBase20.04.fasta/; since GenerateAnnotatedLibrary.java does not resolve the star notation, I had merged all these files and provided that merged file at line 438:

private String getLibraryDirectory () {
   return "repbase.fa";
}

Instead, this should be a directory (without the star notated contents):

private String getLibraryDirectory () {
   return "/nas-hs/db/repbase/repbase/RepBase23.09.fasta/"
}

Thank you very much for your help, Lu!

luzengAdelaide commented 6 years ago

Hi Katharina, I'm glad we have figured this issue out together :-). Thank you so much for helping me polish the carp document!

Please feel free to contact me if you have any further questions.

Many thanks, Lu

rotifergirl commented 5 years ago

Hi, I seem to be having the same problem.

I have modified lines 437-439 to read:

private String getLibraryDirectory () { return "/Users/jblommaert/Desktop/CARP_Annotation/RepeatMaskerLib.fa"; }

which is where my Repbase library is

I get the error

Exception in thread "main" java.lang.NullPointerException at GenerateAnnotatedLibrary$RBClassifier.getClassifications(GenerateAnnotatedLibrary.java:338) at GenerateAnnotatedLibrary$RBClassifier.(GenerateAnnotatedLibrary.java:222) at GenerateAnnotatedLibrary.writeConsensusSequences(GenerateAnnotatedLibrary.java:625) at GenerateAnnotatedLibrary.main(GenerateAnnotatedLibrary.java:468)

I don't think it's the GBTE.fa file, I modified this with the code above

My files and details:

 private static String iDir = "./";

private static String oDir = "library/"; private static String library = oDir + "Denovo_TE_Library.fasta"; private static String headers = oDir + "wantedCSHeaders.txt"; //private static String satFile = sDir + "LA4v2-satellite.fa";//File moved to all-repeats private static String CSFile = iDir + "ConsensusSequences.fa"; private static String TEgff = iDir + "notKnown.fa.tewb.gff"; private static String GBTE = iDir + "GBTE.fa"; private static String ERVgff = iDir + "notKnown.fa.tewb.gff"; private static String ALLR = iDir + "GBTE.fa"; private static String SSR = iDir + "SSR.txt"; private static String Proteins = iDir + "protein.txt"; private static String IRS = iDir + "ConsensusSequences.fa.map"; private static String IRM = iDir + "known.txt"; private static double restMinCoverage = .9; private static double sineMinCoverage = .9; private static boolean debug = false;

ConsensusSequences.fa 8744 lines, 420 sequences, example header: >family000321_consensus (2 members - 2 members within 0.95 of maximum length) notKnown.fa.tewb.gff 2 lines (definitely tab delimited), first one: family000217_consensus blast hit 329 532 8.4e-25 + . Target sp|KRY07909.1 517 584; QueryLength 541; TargetLength 661; Annot "Retrovirus-related Pol polyprote..."

GBTE.fa 9832822 lines, 1212286 sequences, example header: >gi|GBTE|sp|GBL48032.1| DNA-directed RNA polymerase subunit A' [[Candida] auris]

I repeated these two files (notKnown.fa.tewb.gff and GBTE.fa) because I had to ERVs and this gave me a different error

SSR.txt is empty

protein.txt 7 lines, example:

Sequence MappedTo
family000337_consensus sp|P24499

known.txt 65 lines, example:

Sequence MappedTo
family000044_consensus LINE_comp_TRINITY_DN5051_c6_g1_i2#LINE
family000055_consensus LINE_comp_TRINITY_DN5051_c6_g1_i2#LINE

The program produces the Denovo_TE_Library.fasta file, but there is nothing inside

luzengAdelaide commented 5 years ago

Hi rotifergirl ,

Is RepeatMaskerLib.fa a directory or a fasta file? It supposed to be a directory stores Repbase library, which you can find the description on the top of this code: "/home/a1635743/RepBase20.04.fasta/*rep.ref (RepBase libraries to base classification on).". Can you please try it first?

Also, it's totally normal if SSR.txt is empty, meaning censor didn't find any hits to the SSR library.

Please let me know if you still can't figure this problem out! :-)

Kind regards, Lu

rotifergirl commented 5 years ago

Hi Lu,

Since repbase is no longer free, I only have the library as a .fasta, so I'm not sure how to get around this?

Julie

luzengAdelaide commented 5 years ago

Hi Julia,

That's very sad news. To be honest, I have no idea how to get around it either. But you can find an old version of RepBase database by using our Mendeley link below: https://data.mendeley.com/datasets/k88h5xnhcb/1.

Otherwise I assume the best way to get the most recent RepBase data is to get a license from them. :-(

Lu

rotifergirl commented 5 years ago

That seems to have been the problem! Thanks for the help!

luzengAdelaide commented 5 years ago

You are welcome! ;-)