HajkD / LTRpred

De novo annotation of young retrotransposons
https://hajkd.github.io/LTRpred/
GNU General Public License v2.0
46 stars 8 forks source link

Seeking advice for my test case, test run output validation, and post-processing of files #11

Closed gnmcsbnfrmtcsclb closed 4 years ago

gnmcsbnfrmtcsclb commented 5 years ago

I seek help in a few different areas, by way of your response to this post. Some context first - Here are my LTR discovery goals. For my genome of interest, I want to report the following: Type 1. Full-length LTRs Type 2. (Internally) Truncated LTRs (both ends intact) Type 3. Solo LTRs (missing one of the ends) Type 4. Orphan LTRs (missing both ends, but with reecognizable internal features) Can your LTRpred tool report all these 4 cases listed above, or only LTRs of type 1 and type 3 ?

In any case, installation of a few missing dependencies went smoothly. Most of them, I already had installed on my laptop. I ran LTRpred on a full-length genome, as a test run, with quick run syntax shown below.

library(LTRpred) LTRpred(genome.file = "TEST.fasta", cluster = TRUE, cores = 4)

The run's STDOUT is shown in the attached file: LTRpred_TEST_STDOUT_UPDATED.txt

The output folder contents are shown in another attached file:
LTRpred_FolderListing_Expanded.txt

Here are my observations about the output files / folders - could you please add your thoughts to these?

_ltrdigest/_index_ltrdigest.fsa : The suffixarray index file used to predict putative LTR retrotransposonswith LTRdigest.

MISSING - Is this a problem? But derivative files .fsa are present, as shown below

$ ls *_index_ltrdigest.fsa*
TEST_index_ltrdigest.fsa.des
TEST_index_ltrdigest.fsa.md5
TEST_index_ltrdigest.fsa.sds
TEST_index_ltrdigest.fsa.esq
TEST_index_ltrdigest.fsa.prj
TEST_index_ltrdigest.fsa.ssp

_ltrdigest/-ltrdigestpdom_ali.fas : Stores the alignment information for all matches of the given protein domain model to the translations of all candidates.

MISSING - Is this a problem? However, other related file are generated, see below:

ls *-ltrdigest_pdom_*.fas
TEST-ltrdigest_pdom_RNase_H.fas           TEST-ltrdigest_pdom_RVT_2.fas             TEST-ltrdigest_pdom_rve.fas
TEST-ltrdigest_pdom_RNase_H_aa.fas        TEST-ltrdigest_pdom_RVT_2_aa.fas          TEST-ltrdigest_pdom_rve_aa.fas
TEST-ltrdigest_pdom_RVT_1.fas             TEST-ltrdigest_pdom_Retrotrans_gag.fas
TEST-ltrdigest_pdom_RVT_1_aa.fas          TEST-ltrdigest_pdom_Retrotrans_gag_aa.fas

*_orfs_nt.fsa : Stores the predicted open reading frames within the predicted LTR transposons as DNA sequence.

MISSING - Is this a problem?

*_orfs_aa.fsa : Stores the predicted open reading frames within the predicted LTR transposons as protein sequence.

MISSING - Is this a problem?

*_LTRpred_DataSheet.csv : Stores the output table as data sheet.

MISSING - Is this a problem?

However, tsv file created. Can this tsv output be used interchangably for downstream processing, instead of the expected csv? Or do I need to use your pred2csv function?

For generating GFF file containing 4 different types of LTRs based on their degree of sequence completeness (see top of this post), has LTRpred already generated files that I can parse to glean this information using sed, awk, Perl etc. We are total newbies at R.

If not, I'd like to first start with generating solo-LTR annotation. My understanding is that for this, I would need to use ltr.cn, am I right? If not, what functions do I use? Where can I find this info?

     ltr.cn(data.sheet, LTR.fasta_3ltr, LTR.fasta_5ltr, genome,
       ltr.similarity = 70, scope.cutoff = 0.85, perc.ident.cutoff = 70,
       output = NULL, max.hits = 65000, eval = 1e-10, cores = 1)

Thanks, in advance. Charlotte

HajkD commented 4 years ago

Hi @gnmcsbnfrmtcsclb

Thank you very much for contacting me and please accept my apologies for the late reply.

I am very happy to learn that you could install LTRpred without problems and that it is useful for your research and analyses.

You might also find the newly created LTRpred Docker container useful for running LTRpred in a server or cluster environment.

Please find a detailed response to your questions below:

Can your LTRpred tool report all these 4 cases listed above, or only LTRs of type 1 and type 3 ?

Type 1. Full-length LTRs Type 2. (Internally) Truncated LTRs (both ends intact) Type 3. Solo LTRs (missing one of the ends) Type 4. Orphan LTRs (missing both ends, but with reecognizable internal features)

LTRpred can only report your specified Type 1 to Type 3 cases. For deriving Type 4 like elements, the best way would be to run a stringent BLASTN search using the DNA sequences of the typical gag, pol, etc loci found in many retrotransposons (between their 5' and 3' LTRs) against the reference genome that you wish to annotate. If you saw other ways of annotating these elements, I could have a look at these approaches and assess whether it would make sense to integrate this annotation feature into LTRpred.

Regarding the missing files from the documentation, this is no problem at all. I will go over the documentation to make sure that it is up to date.

After looking at your *.txt file, I see that the LTRpred run performed corretly.

However, tsv file created. Can this tsv output be used interchangably for downstream processing, instead of the expected csv? Or do I need to use your pred2csv function? F

Yes, this is just an old documentation issue. I changed the format from csv to tsv. It is exactly the same output. There is no need to use the pred2csv function.

For generating GFF file containing 4 different types of LTRs based on their degree of sequence completeness (see top of this post), has LTRpred already generated files that I can parse to glean this information using sed, awk, Perl etc. We are total newbies at R.

No it has not.

If not, I'd like to first start with generating solo-LTR annotation. My understanding is that for this, I would need to use ltr.cn, am I right? If not, what functions do I use? Where can I find this info?

Absolutely correct. Alternatively, you can specify copy.number.est = TRUE in the LTRpred::LTRpred(...) run which will run ltr.cn() internally and will store the output files in the tempdir() directory.

I hope this helps?

Best wishes, Hajk

gnmcsbnfrmtcsclb commented 4 years ago

Thanks a lot for your email and response here. We have a few follow-up questions / requests:

  1. Is there anywhere, may be library(LTRpred) vignettes (we are new to R) where for some small size sample input file, you show all the details - run syntax, runtime messages on-screen, and all the new files and folders created by the software, including filenames, file extensions so it's easy to compare our runs with a standard run to make sure all is OK?

  2. I think we understand your strategy to identify 'orphan LTRs'. THank you! But for the other 3 categories, i.e. full-length, internally-truncated, and Solo-LTRs, does LTRpred use specific terms/terminology to categorize them in the results, say in the GFF output file? If yes, what are those terminologies we can grep search for? If not, how can we post-process the LTRpred results, say the GFF file, to categorize them as we've indicated?

  3. If we already completed the LTRpred run once successfully, as you've confirmed, then I suppose we can choose to only run ltr.cn() rather than copy.number.est = TRUE for LTRpred::LTRpred(...), correct? And this can be either inside or outside Docker, correct? Are there pros and cons to running LTRpred inside vs. outside of Docker?

HajkD commented 4 years ago
  1. Did you have a chance to look at the documentation? https://hajkd.github.io/LTRpred/articles/Introduction.html#quick-start

  2. I would suggest to import the annotation data sheet returned by LTRpred as stated here: https://hajkd.github.io/LTRpred/articles/Introduction.html#import-ltrpred-output This will give you an overview what to grep for.

  3. No there are no pros or cons. It is the exact same tool. The Docker only helps people who are not able to install all tool dependencies.

gnmcsbnfrmtcsclb commented 4 years ago

Quick follow-up, so for reporting solo 3' and 5' LTRs, our understanding is that you in your previous post in this thread, you'd suggested we follow these steps:

If output results from our LTRpred run are found in filename - Run.tsv, then execute these steps:

library(LTRtype)
Run_Result <- read.ltrpred("Run.tsv")
dplyr::select(Run, ltr_similarity:end, tRNA_motif, Clust_cn) # just a check
Run_Result$cn_3ltr # should give vector of count numbers for that 3' LTR sequence that exists as solo element, across the target genome, right?
Run_Result$cn_5ltr # should give vector of count numbers for that 5' LTR sequence that exists as solo element, across the target genome, right?

Could you please confirm our steps and understanding is correct? Thank you!

In this Run.tsv results file from a successful LTRpred run as input, there are 1,994 rows (and 91 columns)

Howver, Run_Result$cn_5ltrand Run_Result$cn_3ltr both give 1,994 long array of NAs - this means none of the 5' or 3' LTR sequences reported by LTRtype is found as a solo element, which would be a minimum copy number of 1, correct?

Is the complete lack of solo LTRs even possible, because this is an almost 400MB plant genome, and so is expected to contain a lot of repeat elements, including LTRs, at least a few f which would be solo LTRs, right?

So I wonder if I ran LTRpred wrong (no STDOUT error though), or with wrong run or filter parameter(s) etc.?

Since an important goals is reporting solo LTR coords, how do you advice us to go about it, given our update in this post? Thanks, in advance.

HajkD commented 4 years ago

Hi,

Could you please specify the LTRpred::LTRpred(...) call that you were using? Otherwise it is hard for me to assess whether you used the correct parameter configuration.

In the LTRpred::LTRpred(...) function specification (https://hajkd.github.io/LTRpred/reference/LTRpred.html), did you set the argument copy.number.est = TRUE? If not (default is copy.number.est = FALSE) then no copy number estimation (including solo LTR prediction) will be performed and NAs are placed instead.

Is the complete lack of solo LTRs even possible, because this is an almost 400MB plant genome, and so is expected to contain a lot of repeat elements, including LTRs, at least a few f which would be solo LTRs, right?

Correct, there should definitely be solo LTRs.

I hope this helps.