Closed mathog closed 6 years ago
Hi @mathog , So you're saying that the cluster that contains the dlp gene (cluster 54527), in addition to correctly containing half the dlp gene, also contained a whole bunch of other genes?
That's not entirely surprisingly at this "preclustering" stage. You have so far only run run_preCluster.py
which is actually a pre-processing step of the attempt to find gene families. The preclustering step is extremely generous, basically grouping everything that has a minimap2 hit, and minimap2 can detect very low identity (~60% sometimes low) partial hits amongst sequences.
Please continue with the true family finding step as indicated in the tutorial which would be commands such as:
generate_batch_cmd_for_Cogent_family_finding.py --cpus=12 --cmd_filename=cmd preCluster.cluster_info.csv preCluster_out test_human
Once you have run family finding, please let me know how the dlp gene cluster looks. If it's still too relaxed, you can play with the parameters to increase k-mer percentage stringency.
On 25-Jun-2018 12:45, Elizabeth Tseng wrote:
Hi @mathog , So you're saying that the cluster that contains the dlp gene (cluster 54527), in addition to correctly containing half the dlp gene, also contained a whole bunch of other genes?
Yes
That's not entirely surprisingly at this "preclustering" stage. You have so far only run
run_preCluster.py
which is actually a pre-processing step of the attempt to find gene families. The preclustering step is extremely generous, basically grouping everything that has a minimap2 hit, and minimap2 can detect very low identity (~60% sometimes low) partial hits amongst sequences.Please continue with the true family finding step as indicated in the tutorial which would be commands such as:
generate_batch_cmd_for_Cogent_family_finding.py --cpus=12 --cmd_filename=cmd preCluster.cluster_info.csv preCluster_out test_human
Ran the equivalent. Then did
bash cmd
and it blew up after 20 directories. Just to see how this works the dlp directory was located (54527) and did:
cd preCluster_out/54527 nice run_mash.py -k 30 --cpus=40 isoseq_flnc.fasta >rmp.log 2>&1 &
process_kmer_to_graph.py isoseq_flnc.fasta isoseq_flnc.fasta.s1000k30.dist \ ../../test_sp 54527 > pktg.log 2>&1 &
ls -tal 161346067 Jun 25 17:07 54527.graphml 8301840 Jun 25 17:06 54527.partition.txt 236904627 Jun 22 13:18 isoseq_flnc.fasta 2068796284 Jun 25 16:54 isoseq_flnc.fasta.s1000k30.dist 9564315 Jun 25 17:06 pktg.log 3479564 Jun 25 16:54 rmp.log ls -1 ../../testsp | grep ^54527 | wc -l
The next step is supposed to be
generate_batch_cmd_for_Cogent_reconstruction.py
but putting 54527 in there fails in the top level (no such subdirectory) or in testsp (names are 54527*). It runs in preCluster_out without crashing but didn't make any files (or give any warnings). At what location is that command supposed to be run?
Thanks,
David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
Hi @mathog ,
The last parameter in
generate_batch_cmd_for_Cogent_family_finding.py --cpus=12
--cmd_filename=cmd preCluster.cluster_info.csv preCluster_out
test_human
is where the new output will be in. In this case, it's test_human/
and not preCluster_out
. (from the tutorial you can see I even said you could delete preCluster_out
after everything in the command file cmd
is completed).
This means, assuming everything went through, you can run
generate_batch_cmd_for_Cogent_reconstruction.py test_human/
If you are only interested in, say, cluster 54527 and don't care the other ones failed (you said it blew up after 20 directories, which I think should still be looked into), then as long as test_human/54527/
directory exists, you can run generate_batch_cmd_for_Cogent_reconstruction.py
and only run the reconstruction for test_human/54527/
.
It'd still be nice to have all the family finding finish. There's a bash script in the tutorial that let's you find out failed dirs:
#!/bin/bash
counter=0
for d in preCluster_out/* ; do
arr=(${d/\// })
FILE=$d/${arr[1]}".partition.txt";
if [ ! -f $FILE ]; then
echo "$d failed. Please re-run."
counter=$((counter+1))
fi
done
if [ $counter -eq 0 ]; then
echo "All bins completed! Proceed to reconstruction."
else
echo "Some bins failed. Please re-run them."
fi
I used this script to run the final reconstruct_contig.py stage:
#!/bin/bash
#Run a python script on each file.
#
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/mathog/src/Cogent/Complete-Striped-Smith-Waterman-Library/src
export PYTHONPATH=$PYTHONPATH:/home/mathog/src/Cogent/Complete-Striped-Smith-Waterman-Library/src
ATONCE=40
CMD=reconstruct_contig.py
for afile in `ls -1 test_sp`
do
NOW=`date`; echo "$NOW running rc on $afile"
$CMD -S urchin test_sp/$afile >test_sp/$afile/rc.log 2>&1 &
while [ `ps -ef | grep $CMD | wc -l` -gt $ATONCE ]
do
sleep 1
done
done
wait
NOW=`date`; echo "$NOW all jobs exited"
There were ~45000 subdirectories in test_sp, nearly all of them derived from 54527 (because the initial run blew up and didn't make many, and 54527 was run manually because it contained the test gene). It took 108 minutes to run. After that completed this was used to find the subdirectory for the dlp test case in the results and to characterize its contents:
grep -r TRINITYtrinLocDN33715c0g1t2 test_sp
fastasizes <test_sp/54527_175747/cogent2.renamed.fasta
1 7428 >181167|path0
2 7247 >181167|path1
3 16203 >181167|path5
4 7349 >181167|path8
5 443 >181167|path9
6 618 >181167|path10
7 9288 >181167|path11
8 530 >181167|path12
9 607 >181167|path13
10 401 >181167|path14
blastn -query /tmp/dystrophin_mRNA.fasta -subject test_sp/54527_175747/cogent2.renamed.fasta -outfmt 6
dlpRNA 181167|path11 99.871 9288 12 0 1 9288 9288 1 0.0 17086
dlpRNA 181167|path5 99.849 7966 12 0 284 8249 15339 7374 0.0 14645
dlpRNA 181167|path5 99.941 5109 3 0 8248 13356 7267 2159 0.0 9419
dlpRNA 181167|path8 100.000 4433 0 0 8924 13356 755 5187 0.0 8187
dlpRNA 181167|path1 100.000 4433 0 0 8924 13356 653 5085 0.0 8187
dlpRNA 181167|path0 100.000 4433 0 0 8924 13356 834 5266 0.0 8187
dlpRNA 181167|path13 100.000 585 0 0 2329 2913 23 607 0.0 1081
dlpRNA 181167|path10 99.655 580 1 1 12674 13253 40 618 0.0 1059
dlpRNA 181167|path12 100.000 508 0 0 11335 11842 508 1 0.0 939
dlpRNA 181167|path9 99.538 433 1 1 7314 7745 11 443 0.0 787
dlpRNA 181167|path14 99.459 370 2 0 7656 8025 1 370 0.0 673
#vs. selected entries from the initial input file found by Last
#select closest matches to above and edit format slightly (method shown further along in post)
lastal -P40 -f BlastTab /tmp/all_trin_10 /tmp/dystrophin_mRNA.fasta
dlpRNA TRINITYtrinLocDN44543c0g1t1 99.87 9288 12 0 9288 1 1 9288 0 1.46e+
dlpRNA TRINITYtrinLocDN33603c2g1t2 99.92 1307 10 0 284 13356 294 13366 0 2.06e+
-
dlpRNA TRINITYtrinLocDN39939c0g2d1373158t2 99.77 5138 12 0 13356 8219 2153 7290 0 8.09e+
dlpRNA TRINITYtrinLocDN29137c0g1d956656t6 100.00 4433 0 0 8924 13356 755 5187 0 7.01e+
dlpRNA TRINITYtrinLocDN21835c0g1d1110385t4 99.84 4433 6 1 8924 13356 606 5037 0 6.98e+
dlpRNA TRINITYtrinLocDN21835c0g1d1110387t2 99.84 4433 6 1 8924 13356 835 5266 0 6.98e+
dlpRNA TRINITYtrinLocDN24025c2g4t1 100.00 585 0 0 2329 2913 23 607 8e-266 927
dlpRNA TRINITYtrinLocDN30718c0g2d488845t2 99.66 580 1 1 12674 13253 40 618 3.3e-2 901
dlpRNA TRINITYtrinLocDN52500c1g3t2 100.00 508 0 0 11842 11335 1 508 3.6e-2 805
dlpRNA TRINITYtrinLocDN28881c2g3t1 99.53 427 2 0 7319 7745 17 443 1.1e-1 670
dlpRNA TRINITYtrinLocDN52500c1g2t3 99.46 370 2 0 7656 8025 1 370 1.4e-1 580
So, essentially what Cogent did was to collapse classes of similar Trinity transcripts into single (or at least fewer) instances. Functionally that is pretty close to what EvidentialGene does - either way the result is a list of transcripts with more or less one per transcript class. What neither did though was to merge between the families. Path5 and path11 have an essentially identical region of 8kbp where they overlap. In this test case we have reason to believe that path11 is a reasonably accurate representation of the 5' end of (at least some) transcripts: it has what looks like a complete 5' end of the coding sequence for a protein matching (as well as can be expected) that in other organisms. The 5' end of path5 is missing the start of that enormous ORF and the first start in its large ORF is preceded by many starts, so it likely wouldn't code for anything.
It is not clear at this point if the path5 transcripts indicate Trinity assembly errors or if they represent a real type of defective transcript. Brian Haas has been making good progress in getting Trinity to work better with polymorphic, non-single direction reads (what we have), and if that class stops appearing I won't be greatly surprised. dlp is a very hard case - the gene is ~400kbp long and the most recent Trinity (development branch) runs produced a few reasonable looking (full CDS, all transcript sequence mapping in order onto genomic sequence within the gene) transcripts over 15kbp in size. These are essentially the same as path5 from 865bp to the 3' end, but the 5' ends differ.
Then there are the ~45k other test_sp subdirectories also derived from 54527 which presumably contain results for other genes, but perhaps also some dlp? (Commands below uses my extract and execinput programs, they are in drm_tools on sourceforge, or use perl, awk, etc.)
#give all entries names corresponding to their subdirectory and merge into one file
ls -1 test_sp/*/cogent2.renamed.fasta \
| extract -wl 1000000 -mt -dl '/' -fmt 'extract -wl 1000000 -in [1,] -if '\''>'\'' -fmt '\''>[2]|[[2,]]'\'' >> /tmp/all_cogent.fasta' \
| execinput
cd /tmp
grep '>' all_cogent.fasta | head -5
>25005_0|edc956|path0
>25005_3|76e2bc|path0
>25005_7|636540|path0
>25005_8|7f037b|path0
>25706_0|18d1b4|path0
#make a Last database and search it with the reference sequence, remove expected directory
lastdb -c -w 10 all_cogent_10 all_cogent.fasta
lastal -P40 -f BlastTab all_cogent_10 dystrophin_mRNA.fasta \
| grep -v ^# \
| grep -v 54527_175747 \
| tr '\t' ' ' | extract -jr -mt -fmt 'dlpRNA [jl:fw35:2] [fw6:3] [fw4:4] [fw3:5,6] [fw5:7,10] [fw6:11,]'
dlpRNA 54527_9024|a2cdf7|path0 99.42 686 4 0 285 970 38 723 1.3e-3 1.07e+
dlpRNA 54527_177779|127768|path0 99.81 517 1 0 1486 970 1 517 6.5e-2 816
dlpRNA 54527_184272|19ae8c|path0 100.00 502 0 0 3824 4325 1 502 1e-227 795
dlpRNA 54527_107713|77fa31|path0 99.60 497 2 0 5843 6339 2 498 1.9e-2 781
dlpRNA 54527_183672|4e9f85|path0 100.00 422 0 0 5081 4660 1 422 1.2e-1 669
dlpRNA 54527_64597|a1c890|path0 98.71 389 5 0 7156 6768 1 389 3.6e-1 601
dlpRNA 54527_9024|a2cdf7|path2 100.00 288 0 0 384 97 1 288 7.6e-1 457
dlpRNA 54527_64597|a1c890|path1 99.25 265 2 0 7032 6768 38 302 5.4e-1 414
dlpRNA 54527_209310|bb2072|path0 99.24 263 2 0 1712 1450 1 263 4.8e-1 411
dlpRNA 54527_9024|a2cdf7|path1 100.00 189 0 0 285 97 46 234 1e-78 300
dlpRNA 54527_90208|33f342|path0 64.39 278 99 0 10563 10840 2 279 7.9e-2 128
dlpRNA 54527_90208|33f342|path1 65.22 207 72 0 10634 10840 3 209 9.8e-1 101
dlpRNA 54527_81872|9dd5e8|path1 61.48 244 94 0 961 718 13996 14239 2.1e-1 90.1
dlpRNA 54527_94789|4c7117|path0 100.00 40 0 0 8924 8963 706 745 8.7e-0 64.8
It looks like bits of dlp ended up in 10 other directories. The only known repeat in dlpRNA appears once in the transcript at about 7440bp and corresponds to a short genomic direct repeat, and that position isn't in any of these matches.
Hi @mathog ,
What neither did though was to merge between the families
What do you refer to by "families" here? Do you mean merge transcripts/isoforms of the same gene (same locus, different splice forms) or different genes (different loci)?
The intention of Cogent was not to merge different genes (unless it can't help it because the two genes are way too similar).
I'm trying to understand the relationship between path5 and path11. Do I understand correctly that:
path11 matches completely to dlp gene 1-9288 bp. But is on the reverse strand? path5 looks like it matches to dlp, where 2159-15339 matches to 284-13356bp of the dlp gene, with a potentially little blip in the middle. (whether that blip is a gap in the Trinity assembly, currently unclear as you have stated)
I should mention Cogent is strand-aware (because it's used for Iso-Seq data which is strand aware). It looks to me that the input transcripts may be flipped? Not that it matters much, but you may want to flip them to the transcript sense strand.
I'm wondering out loud why path5 and path11 are not merged by Cogent, assembling effectively 1-13356bp of dlp gene (the whole gene). It seems to be it should be based on the mapped coordinates to dlp, unless there's some sequence difference between them that I can't tell based on this report.
Can you show an alignment between path5 and path11?
Families refers to all the sequences which have been clustered together for a given gene's transcripts. As far as we know there is just one dlp in these genomes and nothing very similar to it, so all the paths should be for that and not some mixture with other genes. The gene is huge and there are a lot of exons so there are almost certainly transcript variants mixed in with the Trinity assembly variations. (As in, Trinity results differ between software versions, whereas the biological variation in a given test library is fixed.)
Alignment of path5 and path11:
The overlap of path5 and path11 is identical for the first ~690bp, where there is a SNP, and then identical again until 1039, after which there is a small indel which is visible in the alignment above.
Selectively reverse complementing the input sequences before putting them into cogent isn't an option. For dlp we have a reference sequence, but in general one may not be known, or we may not trust the one we have very much. The zip files below contain the reference sequence shown as "dlpRNA" in the blastn results of previous posts, and the full set of paths including 5 and 11.
The dystrophin like protein (dlp) is being employed as a hard test case while building transcriptomes for several different sea urchin species. The full length dlp mRNA is more than 13kbp and the protein more than 4kaa. 22 RNAseq libraries were built using Trinity 2.6.6 and concatenated (with minor header renaming to avoid conflicts) into all_trin.fasta.
This large fasta file is the input for Cogent. It was searched for dlp pieces with Last like so:
which found 275 hits. The input sequences were clustered with:
There were no errors in the log file.
The largest dlp transcript in the Trinity results was TRINITYtrinLocDN33715c0g1t2 which was found in cluster 54527.
found 121 hits. So about half the hits that Last found are in this cluster. The missing hits are not the problem, the problem is all the other stuff in there:
In other words 7.3% of all sequences and 8.1% of all sequence landed in this cluster.
The Last results and cluster results are not consistent.
A plot of (genomic) tuple counts along the entirety of NCBI's dlp transcript does not show any repetitive sequences, which would have been prone to recruiting other sequences.
What explains this result?