Closed gnmcsbnfrmtcsclb closed 4 years ago
Hi @gnmcsbnfrmtcsclb
Thank you very much for contacting me and please accept my apologies for the late reply.
I am very happy to learn that you could install LTRpred
without problems and that it is useful for your research and analyses.
You might also find the newly created LTRpred
Docker container useful for running LTRpred in a server or cluster environment.
Please find a detailed response to your questions below:
Can your LTRpred tool report all these 4 cases listed above, or only LTRs of type 1 and type 3 ?
Type 1. Full-length LTRs Type 2. (Internally) Truncated LTRs (both ends intact) Type 3. Solo LTRs (missing one of the ends) Type 4. Orphan LTRs (missing both ends, but with reecognizable internal features)
LTRpred
can only report your specified Type 1 to Type 3 cases.
For deriving Type 4 like elements, the best way would be to run a stringent BLASTN search
using the DNA sequences of the typical gag
, pol
, etc loci found in many retrotransposons (between their 5' and 3' LTRs) against the reference genome that you wish to annotate. If you saw other ways of annotating these elements, I could have a look at these approaches and assess whether it would make sense to integrate this annotation feature into LTRpred.
Regarding the missing files from the documentation, this is no problem at all. I will go over the documentation to make sure that it is up to date.
After looking at your *.txt
file, I see that the LTRpred
run performed corretly.
However, tsv file created. Can this tsv output be used interchangably for downstream processing, instead of the expected csv? Or do I need to use your pred2csv function? F
Yes, this is just an old documentation issue. I changed the format from csv
to tsv
. It is exactly the same output. There is no need to use the pred2csv
function.
For generating GFF file containing 4 different types of LTRs based on their degree of sequence completeness (see top of this post), has LTRpred already generated files that I can parse to glean this information using sed, awk, Perl etc. We are total newbies at R.
No it has not.
If not, I'd like to first start with generating solo-LTR annotation. My understanding is that for this, I would need to use
ltr.cn
, am I right? If not, what functions do I use? Where can I find this info?
Absolutely correct. Alternatively, you can specify copy.number.est = TRUE
in the LTRpred::LTRpred(...)
run which will run ltr.cn()
internally and will store the output files in the tempdir()
directory.
I hope this helps?
Best wishes, Hajk
Thanks a lot for your email and response here. We have a few follow-up questions / requests:
Is there anywhere, may be library(LTRpred) vignettes (we are new to R) where for some small size sample input file, you show all the details - run syntax, runtime messages on-screen, and all the new files and folders created by the software, including filenames, file extensions so it's easy to compare our runs with a standard run to make sure all is OK?
I think we understand your strategy to identify 'orphan LTRs'. THank you! But for the other 3 categories, i.e. full-length, internally-truncated, and Solo-LTRs, does LTRpred use specific terms/terminology to categorize them in the results, say in the GFF output file? If yes, what are those terminologies we can grep search for? If not, how can we post-process the LTRpred results, say the GFF file, to categorize them as we've indicated?
If we already completed the LTRpred run once successfully, as you've confirmed, then I suppose we can choose to only run ltr.cn()
rather than copy.number.est = TRUE
for LTRpred::LTRpred(...)
, correct? And this can be either inside or outside Docker, correct? Are there pros and cons to running LTRpred inside vs. outside of Docker?
Did you have a chance to look at the documentation? https://hajkd.github.io/LTRpred/articles/Introduction.html#quick-start
I would suggest to import the annotation data sheet returned by LTRpred
as stated here: https://hajkd.github.io/LTRpred/articles/Introduction.html#import-ltrpred-output
This will give you an overview what to grep
for.
No there are no pros or cons. It is the exact same tool. The Docker only helps people who are not able to install all tool dependencies.
Quick follow-up, so for reporting solo 3' and 5' LTRs, our understanding is that you in your previous post in this thread, you'd suggested we follow these steps:
If output results from our LTRpred run are found in filename - Run.tsv
, then execute these steps:
library(LTRtype)
Run_Result <- read.ltrpred("Run.tsv")
dplyr::select(Run, ltr_similarity:end, tRNA_motif, Clust_cn) # just a check
Run_Result$cn_3ltr # should give vector of count numbers for that 3' LTR sequence that exists as solo element, across the target genome, right?
Run_Result$cn_5ltr # should give vector of count numbers for that 5' LTR sequence that exists as solo element, across the target genome, right?
Could you please confirm our steps and understanding is correct? Thank you!
In this Run.tsv
results file from a successful LTRpred run as input, there are 1,994 rows (and 91 columns)
Howver, Run_Result$cn_5ltr
and Run_Result$cn_3ltr
both give 1,994 long array of NAs - this means none of the 5' or 3' LTR sequences reported by LTRtype is found as a solo element, which would be a minimum copy number of 1, correct?
Is the complete lack of solo LTRs even possible, because this is an almost 400MB plant genome, and so is expected to contain a lot of repeat elements, including LTRs, at least a few f which would be solo LTRs, right?
So I wonder if I ran LTRpred wrong (no STDOUT error though), or with wrong run or filter parameter(s) etc.?
Since an important goals is reporting solo LTR coords, how do you advice us to go about it, given our update in this post? Thanks, in advance.
Hi,
Could you please specify the LTRpred::LTRpred(...)
call that you were using? Otherwise it is hard for me to assess whether you used the correct parameter configuration.
In the LTRpred::LTRpred(...)
function specification (https://hajkd.github.io/LTRpred/reference/LTRpred.html),
did you set the argument copy.number.est = TRUE
? If not (default is copy.number.est = FALSE
) then no copy number estimation (including solo LTR prediction) will be performed and NAs
are placed instead.
Is the complete lack of solo LTRs even possible, because this is an almost 400MB plant genome, and so is expected to contain a lot of repeat elements, including LTRs, at least a few f which would be solo LTRs, right?
Correct, there should definitely be solo LTRs.
I hope this helps.
I seek help in a few different areas, by way of your response to this post. Some context first - Here are my LTR discovery goals. For my genome of interest, I want to report the following: Type 1. Full-length LTRs Type 2. (Internally) Truncated LTRs (both ends intact) Type 3. Solo LTRs (missing one of the ends) Type 4. Orphan LTRs (missing both ends, but with reecognizable internal features) Can your LTRpred tool report all these 4 cases listed above, or only LTRs of type 1 and type 3 ?
In any case, installation of a few missing dependencies went smoothly. Most of them, I already had installed on my laptop. I ran LTRpred on a full-length genome, as a test run, with quick run syntax shown below.
The run's STDOUT is shown in the attached file: LTRpred_TEST_STDOUT_UPDATED.txt
The output folder contents are shown in another attached file:
LTRpred_FolderListing_Expanded.txt
Here are my observations about the output files / folders - could you please add your thoughts to these?
MISSING - Is this a problem? But derivative files .fsa are present, as shown below
MISSING - Is this a problem? However, other related file are generated, see below:
MISSING - Is this a problem?
MISSING - Is this a problem?
MISSING - Is this a problem?
However, tsv file created. Can this tsv output be used interchangably for downstream processing, instead of the expected csv? Or do I need to use your
pred2csv
function?For generating GFF file containing 4 different types of LTRs based on their degree of sequence completeness (see top of this post), has LTRpred already generated files that I can parse to glean this information using sed, awk, Perl etc. We are total newbies at R.
If not, I'd like to first start with generating solo-LTR annotation. My understanding is that for this, I would need to use ltr.cn, am I right? If not, what functions do I use? Where can I find this info?
Thanks, in advance. Charlotte