Open jacau opened 6 years ago
Hi Jasmina, I guess the easy way is just to passing the GB_TE files twice. I tried and this didn't affect my output.
As you didn't do retrovirus protein sequence search, so there is no file called "notKnown.fa.ervwb.gff" as the input to the Java code. Can you please replace it to "notKnown.fa.tewb.gff" (the output of GB_TE search)? Like what you've done with GB_TE.fa, passing twice of it.
Please feel free to contact us if you have any further questions!
Cheers :-), Lu
Hi Lu,
I did that and the errors are the same.
@jacau Hi Jacau, If it's possible, can you please share me the files you used to run this Java code?
Hi Jacau,
The issue is caused by the different format of GB_TE data between yours and mine. NCBI keep updating their library, and changing their data format. In this case, instead of change the Java code, I guess it's easier to just make your GB_TE data consistent with ours. The only thing you need to do is to type the following command lines:
perl -pi -e "s/^>/>gi|GBTE|sp|/g" GBTE.fa
sed -i 's/ /| /' GBTE.fa
Please feel free to contact me if you've any further questions. :-)
I have renamed the fasta headers in my gbte and retrovirus library files. Now, I get different error:
java GenerateAnnotatedLibrary
Exception in thread "main" java.lang.NullPointerException
at GenerateAnnotatedLibrary$RBClassifier.getClassifications(GenerateAnnotatedLibrary.java:338)
at GenerateAnnotatedLibrary$RBClassifier.<init>(GenerateAnnotatedLibrary.java:222)
at GenerateAnnotatedLibrary$RBClassifier.<init>(GenerateAnnotatedLibrary.java:220)
at GenerateAnnotatedLibrary.writeConsensusSequences(GenerateAnnotatedLibrary.java:625)
at GenerateAnnotatedLibrary.main(GenerateAnnotatedLibrary.java:468)
Here is my GenerateAnnotatedLibrary configuration:
private static String iDir = "./";
private static String oDir = "library/";
private static String library = oDir + "Denovo_TE_Library.fasta";
private static String headers = oDir + "wantedCSHeaders.txt";
private static String CSFile = iDir + "ConsensusSequences.fa";
private static String TEgff = iDir + "results_classify/notKnown.fa.tewb.gff";
private static String GBTE = "/nas-hs/db/gb_te/151018_GB_TE.fa";
private static String ERVgff = iDir + "results_classify/notKnown.fa.ervwb.gff";
private static String ALLR = "/nas-hs/db/retroviruses/sequence.rn.fasta";
private static String SSR = iDir + "SSR.txt";
private static String Proteins = "results_classify/protein.txt";
private static String IRS = iDir + "ConsensusSequences.fa.map";
private static String IRM = iDir + "results_classify/known.txt";
Here are some statistics and contents from the input files, maybe that will help tracking down the problem.
CSFile 171802 lines, 3801 sequences, head -3:
>family002795_consensus (2 members - 2 members within 0.95 of maximum length)
AATTCAATAATCAGCGATCTACAGGAACTGAAGTACTGCCGATTGGAACTGTCCACAAAA
TTATCAGGATTTGACAGGCAAAGGCAAATGGGTGATAAGTTTCACAAAGGAAAAGTTGCC
TEgff 292 lines, head -3:
family001882_consensus blast hit 10577 11170 1.47e-76 . . Target sp|CDS25417.2 1 206
family001802_consensus blast hit 2045 2584 6.02e-72 . . Target sp|CDS31061.1 11 190
family001984_consensus blast hit 1 561 1.29e-110 . . Target sp|CDS31628.2 309 495
GBTE 8891585 lines, 1064588 sequences, head -3:
>gi|GBTE|sp|AYJ71526.1| Taq DNA polymerase, partial [synthetic construct]
MEEMLPLFEPKGRVLLVDGHHLAYRTFHALKGLTTSRGEPVQAVYGFAKSLLKALKEDGDSVIVVFDAKA
PSFRHEAYEGYKARRAPTPEDFPRQLALIKELVDLLGLVRLEVPGYEADDVLASLAKKAEKEGYEVRILT
ERVgff 45 lines, head -3:
family002413_consensus blast hit 2351 2452 2.45e-06 . . Target sp|NC_027117.1 4650 4751
family002553_consensus blast hit 1018 1254 8.81e-13 . . Target sp|NC_001403.1 3263 3499
family003148_consensus blast hit 1570 2118 1.22e-36 . . Target sp|NC_039238.1 4384 4932
ALLR 10195 lines, 80 sequences, head -3:
>gi|GBTE|sp|NC_039242.1| Feline foamy virus DNA, complete genome
TGTCATGGGCCAAAGAGAATTCTCACAGAGGAGAATACTCTCTGCTGCCATCTAGTGACGATGAGGAAGA
AGAAATGTCAGAAAGAGAGGAATTATTGTGCCATATAAATCAGTGTCAACAAAAGCTCTTTTATCCCGGA
SSR 0 lines (empty file, size 0) ... I don't have any SSRs...
Proteins 663 lines, head -3:
Sequence MappedTo
family002496_consensus sp|sp|Q6GNY1|MIB1_XENLA
family002142_consensus sp|sp|Q7JQ07|MOS1T_DROMA
IRS 33 lines, head -3:
family000189_consensus 10767 10823 HAL1b 2199 2256 c 0 0 231
family000505_consensus 280 325 MARNA 335 381 c 0 0 231
family000597_consensus 457 764 HSMAR2 828 1137 d 0 0 431
IRM
1line:
Sequence MappedTo
Java version openjdk 10.0.2 2018-07-17 OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.2) OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.2, mixed mode)
Information on organism I am running CARP on Hymenolepis microstoma (Genbank assembly, has 3643 sequences, 182136974 bp, GC content 35%. This is a tiny test genome. The goal is run it on a reptile once I figure out how to run it ;-)
Any thoughts on how to fix the java error?
Best,
Katharina
Hi Katharina,
I may found the issue that caused the error. It seems the input of your "TEgff (notKnown.fa.tewb.gff)" is separated by spaces, rather tab-delimited. Can you please replace the spaces to one tab in this file, and rerun the code again?
sed -i 's/ + /\t/g' notKnown.fa.tewb.gff
Cheers, Lu
Dear Lu, that is not the source of the problem. If it looks like there are spaces, not tabs, that is a an artefact of copying the file contents here into github. In the file, there are tabs, already. (I nevertheless ran the suggested sed command, but it had no effect since there were no spaces, and thus, the problem remains.) I also checked the notKnown.fa.spwb.gff, and notKnown.fa.ervwb.gff, notKnown.fa.gff; tabs are there. What else might be the problem? Feel free to contact me via e-mail (katharina.hoff@uni-greifswald.de), I can make all files available to you via ftp. Best, Katharina
Lu helped me figure out the problem. I had misunderstood the comment in the head of GenerateAnnotatedLibrary.java:
/home/a1635743/RepBase20.04.fasta/rep.ref <- I thought that refers to all the rep.ref files in /home/a1635743/RepBase20.04.fasta/; since GenerateAnnotatedLibrary.java does not resolve the star notation, I had merged all these files and provided that merged file at line 438:
private String getLibraryDirectory () {
return "repbase.fa";
}
Instead, this should be a directory (without the star notated contents):
private String getLibraryDirectory () {
return "/nas-hs/db/repbase/repbase/RepBase23.09.fasta/"
}
Thank you very much for your help, Lu!
Hi Katharina, I'm glad we have figured this issue out together :-). Thank you so much for helping me polish the carp document!
Please feel free to contact me if you have any further questions.
Many thanks, Lu
Hi, I seem to be having the same problem.
I have modified lines 437-439 to read:
private String getLibraryDirectory () { return "/Users/jblommaert/Desktop/CARP_Annotation/RepeatMaskerLib.fa"; }
which is where my Repbase library is
I get the error
Exception in thread "main" java.lang.NullPointerException at GenerateAnnotatedLibrary$RBClassifier.getClassifications(GenerateAnnotatedLibrary.java:338) at GenerateAnnotatedLibrary$RBClassifier.
(GenerateAnnotatedLibrary.java:222) at GenerateAnnotatedLibrary.writeConsensusSequences(GenerateAnnotatedLibrary.java:625) at GenerateAnnotatedLibrary.main(GenerateAnnotatedLibrary.java:468)
I don't think it's the GBTE.fa file, I modified this with the code above
My files and details:
private static String iDir = "./";
private static String oDir = "library/"; private static String library = oDir + "Denovo_TE_Library.fasta"; private static String headers = oDir + "wantedCSHeaders.txt"; //private static String satFile = sDir + "LA4v2-satellite.fa";//File moved to all-repeats private static String CSFile = iDir + "ConsensusSequences.fa"; private static String TEgff = iDir + "notKnown.fa.tewb.gff"; private static String GBTE = iDir + "GBTE.fa"; private static String ERVgff = iDir + "notKnown.fa.tewb.gff"; private static String ALLR = iDir + "GBTE.fa"; private static String SSR = iDir + "SSR.txt"; private static String Proteins = iDir + "protein.txt"; private static String IRS = iDir + "ConsensusSequences.fa.map"; private static String IRM = iDir + "known.txt"; private static double restMinCoverage = .9; private static double sineMinCoverage = .9; private static boolean debug = false;
ConsensusSequences.fa
8744 lines, 420 sequences, example header:
>family000321_consensus (2 members - 2 members within 0.95 of maximum length)
notKnown.fa.tewb.gff
2 lines (definitely tab delimited), first one:
family000217_consensus blast hit 329 532 8.4e-25 + . Target sp|KRY07909.1 517 584; QueryLength 541; TargetLength 661; Annot "Retrovirus-related Pol polyprote..."
GBTE.fa
9832822 lines, 1212286 sequences, example header:
>gi|GBTE|sp|GBL48032.1| DNA-directed RNA polymerase subunit A' [[Candida] auris]
I repeated these two files (notKnown.fa.tewb.gff and GBTE.fa) because I had to ERVs and this gave me a different error
SSR.txt is empty
protein.txt 7 lines, example:
Sequence MappedTo
family000337_consensus sp|P24499
known.txt 65 lines, example:
Sequence MappedTo
family000044_consensus LINE_comp_TRINITY_DN5051_c6_g1_i2#LINE
family000055_consensus LINE_comp_TRINITY_DN5051_c6_g1_i2#LINE
The program produces the Denovo_TE_Library.fasta file, but there is nothing inside
Hi rotifergirl ,
Is RepeatMaskerLib.fa a directory or a fasta file? It supposed to be a directory stores Repbase library, which you can find the description on the top of this code: "/home/a1635743/RepBase20.04.fasta/*rep.ref (RepBase libraries to base classification on).". Can you please try it first?
Also, it's totally normal if SSR.txt is empty, meaning censor didn't find any hits to the SSR library.
Please let me know if you still can't figure this problem out! :-)
Kind regards, Lu
Hi Lu,
Since repbase is no longer free, I only have the library as a .fasta, so I'm not sure how to get around this?
Julie
Hi Julia,
That's very sad news. To be honest, I have no idea how to get around it either. But you can find an old version of RepBase database by using our Mendeley link below: https://data.mendeley.com/datasets/k88h5xnhcb/1.
Otherwise I assume the best way to get the most recent RepBase data is to get a license from them. :-(
Lu
That seems to have been the problem! Thanks for the help!
You are welcome! ;-)
Hi, I used carp-te on a plant genome, and everything went well until the last step of generating the library. The error I'm getting is:
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1967) at GenerateAnnotatedLibrary.getAllRetroAnnotations(GenerateAnnotatedLibrary.java:531) at GenerateAnnotatedLibrary.writeConsensusSequences(GenerateAnnotatedLibrary.java:623) at GenerateAnnotatedLibrary.main(GenerateAnnotatedLibrary.java:468)
It seems to be an issue with the GB_TE.fa file which I downloaded using the efetch perl script provided in code. Also, I didn't do retrovirus protein sequence search, so I tried just passing the GB_TE files twice (is there a better way to handle this?).
Thanks, Jasmina