Sequence Length - Githubissues

chrisjackson-pellicle / NewTargets

Files and scripts for generating customised, tailored target files for target-capture analyses.

GNU General Public License v3.0

7 stars 2 forks source link

Sequence Length #1

Closed Wwwangwh closed 1 year ago

Wwwangwh commented 1 year ago

Hello Prof， Thank you very much for developing this script, which provides us more possibilities for subsequent analysis. Here are some questions that I am confused about.

The enhanced sequence length shorter than the original probe length. I saw the following description in your article. In cases where a transcriptome hit sequence was shorter than the longest original target file sequence for a given locus, the transcriptome hit sequence was extended by grafting with the 5′ and/or 3′ termini of the closest related original sequence. My understanding of this statement is that the final sequences in BYO_target.fasta that obtained from other transcripts should be the same length as the original reference sequence（Ignore indel). But I see a lot of added sequences are shorter than reference (Especially at the beginning and end of the sequence). I don't know how this is caused, and if it has any bad effect on the subsequent analysis of hybpiper. The figure below shows one of the results.
ERROR: 'The following target file sequences cannot be translated without stop codons in any forwards' We did not consider the stop code when we design the probes, which led to some of our sequences do not contain a stop codon. I tried to annotate the judgment，it turns out that the program works fine and the result file looks good. Will this led to some problem? Here is the code that I annotated. """ if len(seqs_with_frameshifts_dict) != 0: logger.info(f'The following target file sequences cannot be translated without stop codons in any forwards ' f'frame, please correct this:') for key, value in seqs_with_frameshifts_dict.items(): logger.info(f'Gene {key}: {", ".join(value)}') sys.exit(f'Please check your target file and try again') """
Does the input file affect the final result? I test raw transcriptome data(transcriptome denovo assenble) and cds data(predicted by TransDecoder) as input data, the result shows that using cds data as input file can get about 50% more sequences than original transcriptome. I would like to know your choice of these datasets. Does cds as input have some bad effect?

Looking forward to you reply! Best wish

Wangwh

chrisjackson-pellicle commented 1 year ago

Hi @Wwwangwh - apologies for the delay; I just got back from 3 months of parental leave. Just letting you know I've seen this, and will address/answer it as soon as I can.

Cheers,

Chris

chrisjackson-pellicle commented 1 year ago

Hi @Wwwangwh,

Sorry for the delay in replying. In response to your questions:

1) Are you: a) using BYO_transcriptomes.py with a target file that contains only a single representative sequence for each gene? Or b) are there multiple representative sequences for a given gene, and your Geneious alignment is only showing one of them?

If b) is true, then your newly added transcriptome sequences can indeed be shorter than your longest original reference for a given gene, as `BYO_transcriptomes.py` grafts new sequences with the _closest_ related original reference. The new transcriptome sequence can only be grafted (and hence hopefully extended) to the length of the closest related original sequence - if this is short, it sets a maximum possible length for your new transcriptome sequence. 

If a) is true, and you're still seeing that your added transcriptome sequences are short compared to the single reference for a given gene, I'll need to look in to this further. If so, can you upload your target file, and also one of your transcriptomes if the file size isn't too large?

Having short target file sequences when running HybPiper _can_ reduce the length of your locus recovery for a given gene. If a short target file sequence is chosen as the 'best' reference for a gene, it is translated in to a protein sequence and used as a query in Exonerate searches against the assembled SPAdes contigs for a sample, to extract exonic coding regions for your sample. If the Exonerate query is short, it sets a maximum length on the sequence that can be extracted from the SPAdes contigs.

2) I don't fully understand what you mean here. Can you clarify this, please?

3) Interesting! I haven't benchmarked recovery with transcriptome vs TransDecoder predicted CDS before. I guess the effect could depend on how you run TransDecoder - if you're outputing the single longest ORF from each transcript, it could perhaps miss genes/ORFs of interest. But I would expect that to reduce your recovery, whereas you say using CDS sequences increases it by 50%.

I can't immediately see why using CDS sequences would cause this increased recovery - I would expect HMM searches (generated from alignments of your existing target file) to get the same hits from a longer transcript vs a shorter extracted CDS sequence. Are you able to upload a transcriptome and the corresponding TransDecoder CDS sequences, and I'll do some investigations?

Cheers,

Chris

Wwwangwh commented 1 year ago

Hello Prof， I'm sorry, I don't remember the exact steps I took at that time, so I redid some testing. A total of 4 transcriptomes were selected for the test. These data are downloaded from onekp(https://db.cngb.org/onekp/search/) . I searched and downloaded the transcriptome data of these four species Rhodiola rosea \ Hamamelis virginiana \ Cercidiphyllum japonicum \ Crassula perforata The target probe file contains 11 gene sequences that were selected from the genome of Kalanchoe. Here is the probe file. 11gene_test.zip

1.For gene CAM-1, three out of the four transcriptomes matched with homologous genes of this gene. It seems that the main issue lies at the beginning and end of the sequence.The following image shows the result of aligning these four sequences using MUSCLE. 123 The start and end of 2 of the 3 matched sequences are complete, but the start and end of KWGC have some missing parts.

2.This error has now magically disappeared. It's possible that we have processed the target probe file correctly since I can't quite remember which file was used at that time. (my original probe files were assembled by Trinity without any processing). Never mind about this question. It may be due to my mistake at that time

3.This is the command of TransDecoder I ran. basename -s .fa ./.fa |xargs -I {} -P 7 bash -c "TransDecoder.LongOrfs -t {}.fa " basename -s .fa ./.fa |xargs -I {} -P 7 bash -c "TransDecoder.Predict -t {}.fa "
The four *. fa.transdecoder.cds files generated here are used as the file input for BYO_transcriptomes.py. BYO parameters can be viewed in the attachment log file (-length_percentage 0.8). The 1.trans_org_result folder is the result of running the original transcriptome. The other is processed by TransDecoder(2.transdecodered_result). 3.BYO_result.zip According to the summary report in 18_reports, the unprocessed transcriptome ultimately had 30 sequences in the new target_file and showed a warning message "WARNING: A newly added sequence is longer than it should be for gene 2!". However, after going through TransDecoder processing, the final output contained 44 sequences and did not have any warnings about long sequences. I hope this was not due to my mistake. I have some transcriptome and probe files for the groups that I am focused on. If you are interested, I can perform some additional tests. However, since the data is not yet publicly available, I am unable to upload it online at this time. If there is anything that I haven't explained clearly, please let me know.

Cheers,

Wangwh

chrisjackson-pellicle commented 1 year ago

Hi @Wwwangwh,

Thanks for those files and info. I've downloaded your target file, as well as the four transcriptomes you list, and also the transcriptome for DRIL (Kalanchoe_crenato-diagremontiana) which is present in your TransDecoder vs unfiltered transcriptome comparison results.

1) Ah - did you run BYO_transcriptomes.py with the parameter -length_percentage 0.8 for this run too, as you did for the TransDecoder comparisons? In my run of the data using default settings (i.e. a value of 1.00 for the -length_percentage parameter), the termini of the sequence for KWGC-1 are full-length with respect to the CAM-1 reference, as expected. The un-grafted transcriptome hit sequence for KWGC-1 is 714 bases, whereas the CAM-1reference is 786 bases. Because the ratio of the length of the KWGC-1 sequence vs CAM-1 is above the -length_percentage 0.8 value you specified (714 / 786 = ~0.91), no grafting is performed for KWGC-1 in your runs.

2) No worries :)

3) I've had a look at your output files and run a few tests, and it looks like you're right! Processing the transcriptomes with TransDecoder and then running BYO_transcriptomes.py on the results does indeed recover more target sequences. More worryingly (and this is something I'll need to fix), it recovers 'better' sequences, in the sense that they match the query HMM profile better. I've had a look at several such cases, and it appears to be because:

  a) The `hmmsearch` searches carried out by `BYO_transcriptomes.py` only search the positive strand of each transcript.
  b) TransDecoder searches both strands when extracting what it thinks is the 'best' coding sequence.
  c) The TransDecoder sequences extracted from the negative strand of transcripts can match the HMM profiles using in `BYO_transcriptomes.py` better than any hit on the postive strands; this results in better hits with TransDecoder data, or hits missing entirely when using un-processed transcriptomes as input.

Thanks for uncovering this. When I get a chance, I'll release a new version of BYO_transcriptomes.py and add an optional parameter to run TransDecoder on the input transcriptomes first.

Cheers,

Chris

Wwwangwh commented 1 year ago

hi @chrisjackson-pellicle,

There's no problem now. This should be my misunderstanding of the parameter '-length_percentage'.

Cheers,

wangwh

chrisjackson-pellicle commented 1 year ago

Hi @Wwwangwh - I've just pushed a new version of the BYO_transcriptome.py script. Rather than running TransDecoder on the input transcriptomes prior to searching with HMMs, it now:

1) Uses the nhmmer command rather than the hmmsearch command (both from the package HMMER). nhmmer searches both forward and reverse strands of each sequence by default. 2) Parses the nhhmer output file to check whether the top hit (or top number of hits if the parameter -num_hits_to_recover is used) is on the reverse strand; if so, the transcriptome sequence is reverse-complemented prior to alignment and further processing.

This appears to be working well on my test dataset, and also when using your data - I'm getting more sequences with higher identity added to the final target file. It's also much faster than running TransDecoder, and avoids missing good hits when TransDecoder extracts the incorrect predicted CDS sequence from a given transcript.

There are a few other changes too - see the change_log.md here.

Let me know how it goes!

Cheers,

Chris

Wwwangwh commented 1 year ago

Hi @chrisjackson-pellicle， I successfully obtained a target file containing 1610 sequences without errors by running the updated script on my larger dataset, which consisted of 554 probe sequences and 5 transcriptomes. This has already achieved the results I wanted.

Just let you know.

Cheers,

Wangwh