Magdoll / Cogent

Coding Genome Reconstruction using Iso-Seq data
BSD 3-Clause Clear License
60 stars 17 forks source link

Excessive clustering observed for one gene #39

Closed mathog closed 6 years ago

mathog commented 6 years ago

The dystrophin like protein (dlp) is being employed as a hard test case while building transcriptomes for several different sea urchin species. The full length dlp mRNA is more than 13kbp and the protein more than 4kaa. 22 RNAseq libraries were built using Trinity 2.6.6 and concatenated (with minor header renaming to avoid conflicts) into all_trin.fasta.

This large fasta file is the input for Cogent. It was searched for dlp pieces with Last like so:

lastdb -c -w 10 all_trin_10 all_trin.fasta
lastal -P40 -f BlastTab all_trin_10 dystrophin_mRNA.fasta | grep -v ^# | wc

which found 275 hits. The input sequences were clustered with:

ln -s all_trin.fasta isoseq_flnc.fasta
nice run_preCluster.py --cpus=40 >rp.log 2>&1 &

There were no errors in the log file.

The largest dlp transcript in the Trinity results was TRINITYtrinLocDN33715c0g1t2 which was found in cluster 54527.

cp preCluster_out/54527/isoseq_flnc.fasta /tmp
(cd /tmp; lastdb -c -w 10 isoseq_flnc_10 isoseq_flnc.fasta)
lastal -P40 -f BlastTab /tmp/isoseq_flnc_10 dystrophin_mRNA.fasta | grep -v ^# | wc

found 121 hits. So about half the hits that Last found are in this cluster. The missing hits are not the problem, the problem is all the other stuff in there:

A all_trin.fasta
B preCluster_out/54527/isoseq_flnc.fasta

what     seqs         bp min   max
A     3319071 2801519264 201 36007
B      243068  228752161 201 36007

In other words 7.3% of all sequences and 8.1% of all sequence landed in this cluster.

The Last results and cluster results are not consistent.

A plot of (genomic) tuple counts along the entirety of NCBI's dlp transcript does not show any repetitive sequences, which would have been prone to recruiting other sequences.

What explains this result?

Magdoll commented 6 years ago

Hi @mathog , So you're saying that the cluster that contains the dlp gene (cluster 54527), in addition to correctly containing half the dlp gene, also contained a whole bunch of other genes?

That's not entirely surprisingly at this "preclustering" stage. You have so far only run run_preCluster.py which is actually a pre-processing step of the attempt to find gene families. The preclustering step is extremely generous, basically grouping everything that has a minimap2 hit, and minimap2 can detect very low identity (~60% sometimes low) partial hits amongst sequences.

Please continue with the true family finding step as indicated in the tutorial which would be commands such as:

generate_batch_cmd_for_Cogent_family_finding.py --cpus=12 --cmd_filename=cmd preCluster.cluster_info.csv preCluster_out test_human

Once you have run family finding, please let me know how the dlp gene cluster looks. If it's still too relaxed, you can play with the parameters to increase k-mer percentage stringency.

mathog commented 6 years ago

On 25-Jun-2018 12:45, Elizabeth Tseng wrote:

Hi @mathog , So you're saying that the cluster that contains the dlp gene (cluster 54527), in addition to correctly containing half the dlp gene, also contained a whole bunch of other genes?

Yes

That's not entirely surprisingly at this "preclustering" stage. You have so far only run run_preCluster.py which is actually a pre-processing step of the attempt to find gene families. The preclustering step is extremely generous, basically grouping everything that has a minimap2 hit, and minimap2 can detect very low identity (~60% sometimes low) partial hits amongst sequences.

Please continue with the true family finding step as indicated in the tutorial which would be commands such as:

generate_batch_cmd_for_Cogent_family_finding.py --cpus=12
--cmd_filename=cmd preCluster.cluster_info.csv preCluster_out
test_human

Ran the equivalent. Then did

bash cmd

and it blew up after 20 directories. Just to see how this works the dlp directory was located (54527) and did:

cd preCluster_out/54527 nice run_mash.py -k 30 --cpus=40 isoseq_flnc.fasta >rmp.log 2>&1 &

13:03 - 16:54.

process_kmer_to_graph.py isoseq_flnc.fasta isoseq_flnc.fasta.s1000k30.dist \ ../../test_sp 54527 > pktg.log 2>&1 &

about 7 min

ls -tal 161346067 Jun 25 17:07 54527.graphml 8301840 Jun 25 17:06 54527.partition.txt 236904627 Jun 22 13:18 isoseq_flnc.fasta 2068796284 Jun 25 16:54 isoseq_flnc.fasta.s1000k30.dist 9564315 Jun 25 17:06 pktg.log 3479564 Jun 25 16:54 rmp.log ls -1 ../../testsp | grep ^54527 | wc -l

43335

The next step is supposed to be

generate_batch_cmd_for_Cogent_reconstruction.py

but putting 54527 in there fails in the top level (no such subdirectory) or in testsp (names are 54527*). It runs in preCluster_out without crashing but didn't make any files (or give any warnings). At what location is that command supposed to be run?

Thanks,

David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech

Magdoll commented 6 years ago

Hi @mathog ,

The last parameter in

 generate_batch_cmd_for_Cogent_family_finding.py --cpus=12
 --cmd_filename=cmd preCluster.cluster_info.csv preCluster_out
 test_human

is where the new output will be in. In this case, it's test_human/ and not preCluster_out. (from the tutorial you can see I even said you could delete preCluster_out after everything in the command file cmd is completed).

This means, assuming everything went through, you can run

generate_batch_cmd_for_Cogent_reconstruction.py test_human/

If you are only interested in, say, cluster 54527 and don't care the other ones failed (you said it blew up after 20 directories, which I think should still be looked into), then as long as test_human/54527/ directory exists, you can run generate_batch_cmd_for_Cogent_reconstruction.py and only run the reconstruction for test_human/54527/.

It'd still be nice to have all the family finding finish. There's a bash script in the tutorial that let's you find out failed dirs:

#!/bin/bash

counter=0
for d in preCluster_out/* ; do
    arr=(${d/\// })
    FILE=$d/${arr[1]}".partition.txt";
    if [ ! -f $FILE ]; then
        echo "$d failed. Please re-run."
        counter=$((counter+1))
    fi
done

if [ $counter -eq 0 ]; then
    echo "All bins completed! Proceed to reconstruction."
else
    echo "Some bins failed. Please re-run them."
fi
mathog commented 6 years ago

I used this script to run the final reconstruct_contig.py stage:

#!/bin/bash
#Run a python script on each file.
#
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/mathog/src/Cogent/Complete-Striped-Smith-Waterman-Library/src
export PYTHONPATH=$PYTHONPATH:/home/mathog/src/Cogent/Complete-Striped-Smith-Waterman-Library/src
ATONCE=40
CMD=reconstruct_contig.py
for afile in `ls -1 test_sp`
do
   NOW=`date`; echo "$NOW running rc on $afile"
   $CMD -S urchin test_sp/$afile >test_sp/$afile/rc.log 2>&1 &
   while [ `ps -ef | grep $CMD | wc -l` -gt $ATONCE ]
   do
     sleep 1
   done
done
wait
NOW=`date`; echo "$NOW all jobs exited"

There were ~45000 subdirectories in test_sp, nearly all of them derived from 54527 (because the initial run blew up and didn't make many, and 54527 was run manually because it contained the test gene). It took 108 minutes to run. After that completed this was used to find the subdirectory for the dlp test case in the results and to characterize its contents:

grep -r TRINITYtrinLocDN33715c0g1t2 test_sp
fastasizes <test_sp/54527_175747/cogent2.renamed.fasta 
        1      7428 >181167|path0
        2      7247 >181167|path1
        3     16203 >181167|path5
        4      7349 >181167|path8
        5       443 >181167|path9
        6       618 >181167|path10
        7      9288 >181167|path11
        8       530 >181167|path12
        9       607 >181167|path13
       10       401 >181167|path14
blastn -query /tmp/dystrophin_mRNA.fasta -subject test_sp/54527_175747/cogent2.renamed.fasta  -outfmt 6
dlpRNA 181167|path11    99.871 9288 12 0     1  9288  9288    1 0.0 17086
dlpRNA 181167|path5     99.849 7966 12 0   284  8249 15339 7374 0.0 14645
dlpRNA 181167|path5     99.941 5109  3 0  8248 13356  7267 2159 0.0 9419
dlpRNA 181167|path8    100.000 4433  0 0  8924 13356   755 5187 0.0 8187
dlpRNA 181167|path1    100.000 4433  0 0  8924 13356   653 5085 0.0 8187
dlpRNA 181167|path0    100.000 4433  0 0  8924 13356   834 5266 0.0 8187
dlpRNA 181167|path13   100.000  585  0 0  2329  2913    23  607 0.0 1081
dlpRNA 181167|path10    99.655  580  1 1 12674 13253    40  618 0.0 1059
dlpRNA 181167|path12   100.000  508  0 0 11335 11842   508    1 0.0 939
dlpRNA 181167|path9     99.538  433  1 1  7314  7745    11  443 0.0 787
dlpRNA 181167|path14    99.459  370  2 0  7656  8025     1  370 0.0 673

#vs. selected entries from the initial input file found by Last
#select closest matches to above and edit format slightly (method shown further along in post)
lastal -P40 -f BlastTab /tmp/all_trin_10 /tmp/dystrophin_mRNA.fasta 
dlpRNA TRINITYtrinLocDN44543c0g1t1          99.87 9288  12   0  9288     1     1  9288      0 1.46e+
dlpRNA TRINITYtrinLocDN33603c2g1t2          99.92 1307  10   0   284 13356   294 13366      0 2.06e+
-
dlpRNA TRINITYtrinLocDN39939c0g2d1373158t2  99.77 5138  12   0 13356  8219  2153  7290      0 8.09e+
dlpRNA TRINITYtrinLocDN29137c0g1d956656t6  100.00 4433   0   0  8924 13356   755  5187      0 7.01e+
dlpRNA TRINITYtrinLocDN21835c0g1d1110385t4  99.84 4433   6   1  8924 13356   606  5037      0 6.98e+
dlpRNA TRINITYtrinLocDN21835c0g1d1110387t2  99.84 4433   6   1  8924 13356   835  5266      0 6.98e+
dlpRNA TRINITYtrinLocDN24025c2g4t1         100.00  585   0   0  2329  2913    23   607 8e-266    927
dlpRNA TRINITYtrinLocDN30718c0g2d488845t2   99.66  580   1   1 12674 13253    40   618 3.3e-2    901
dlpRNA TRINITYtrinLocDN52500c1g3t2         100.00  508   0   0 11842 11335     1   508 3.6e-2    805
dlpRNA TRINITYtrinLocDN28881c2g3t1          99.53  427   2   0  7319  7745    17   443 1.1e-1    670
dlpRNA TRINITYtrinLocDN52500c1g2t3          99.46  370   2   0  7656  8025     1   370 1.4e-1    580

So, essentially what Cogent did was to collapse classes of similar Trinity transcripts into single (or at least fewer) instances. Functionally that is pretty close to what EvidentialGene does - either way the result is a list of transcripts with more or less one per transcript class. What neither did though was to merge between the families. Path5 and path11 have an essentially identical region of 8kbp where they overlap. In this test case we have reason to believe that path11 is a reasonably accurate representation of the 5' end of (at least some) transcripts: it has what looks like a complete 5' end of the coding sequence for a protein matching (as well as can be expected) that in other organisms. The 5' end of path5 is missing the start of that enormous ORF and the first start in its large ORF is preceded by many starts, so it likely wouldn't code for anything.

It is not clear at this point if the path5 transcripts indicate Trinity assembly errors or if they represent a real type of defective transcript. Brian Haas has been making good progress in getting Trinity to work better with polymorphic, non-single direction reads (what we have), and if that class stops appearing I won't be greatly surprised. dlp is a very hard case - the gene is ~400kbp long and the most recent Trinity (development branch) runs produced a few reasonable looking (full CDS, all transcript sequence mapping in order onto genomic sequence within the gene) transcripts over 15kbp in size. These are essentially the same as path5 from 865bp to the 3' end, but the 5' ends differ.

Then there are the ~45k other test_sp subdirectories also derived from 54527 which presumably contain results for other genes, but perhaps also some dlp? (Commands below uses my extract and execinput programs, they are in drm_tools on sourceforge, or use perl, awk, etc.)

#give all entries names corresponding to their subdirectory and merge into one file
ls -1 test_sp/*/cogent2.renamed.fasta \
 | extract -wl 1000000 -mt -dl '/' -fmt 'extract  -wl 1000000 -in [1,] -if '\''>'\'' -fmt '\''>[2]|[[2,]]'\'' >> /tmp/all_cogent.fasta' \
 | execinput
cd /tmp
grep '>' all_cogent.fasta | head -5
>25005_0|edc956|path0
>25005_3|76e2bc|path0
>25005_7|636540|path0
>25005_8|7f037b|path0
>25706_0|18d1b4|path0
#make a Last database and search it with the reference sequence, remove expected directory
lastdb -c -w 10 all_cogent_10 all_cogent.fasta
lastal -P40 -f BlastTab all_cogent_10 dystrophin_mRNA.fasta \
  | grep -v ^# \
  | grep -v 54527_175747 \
  | tr '\t' ' ' | extract -jr -mt -fmt 'dlpRNA [jl:fw35:2] [fw6:3] [fw4:4] [fw3:5,6] [fw5:7,10] [fw6:11,]'
dlpRNA 54527_9024|a2cdf7|path0              99.42  686   4   0   285   970    38   723 1.3e-3 1.07e+
dlpRNA 54527_177779|127768|path0            99.81  517   1   0  1486   970     1   517 6.5e-2    816
dlpRNA 54527_184272|19ae8c|path0           100.00  502   0   0  3824  4325     1   502 1e-227    795
dlpRNA 54527_107713|77fa31|path0            99.60  497   2   0  5843  6339     2   498 1.9e-2    781
dlpRNA 54527_183672|4e9f85|path0           100.00  422   0   0  5081  4660     1   422 1.2e-1    669
dlpRNA 54527_64597|a1c890|path0             98.71  389   5   0  7156  6768     1   389 3.6e-1    601
dlpRNA 54527_9024|a2cdf7|path2             100.00  288   0   0   384    97     1   288 7.6e-1    457
dlpRNA 54527_64597|a1c890|path1             99.25  265   2   0  7032  6768    38   302 5.4e-1    414
dlpRNA 54527_209310|bb2072|path0            99.24  263   2   0  1712  1450     1   263 4.8e-1    411
dlpRNA 54527_9024|a2cdf7|path1             100.00  189   0   0   285    97    46   234  1e-78    300
dlpRNA 54527_90208|33f342|path0             64.39  278  99   0 10563 10840     2   279 7.9e-2    128
dlpRNA 54527_90208|33f342|path1             65.22  207  72   0 10634 10840     3   209 9.8e-1    101
dlpRNA 54527_81872|9dd5e8|path1             61.48  244  94   0   961   718 13996 14239 2.1e-1   90.1
dlpRNA 54527_94789|4c7117|path0            100.00   40   0   0  8924  8963   706   745 8.7e-0   64.8

It looks like bits of dlp ended up in 10 other directories. The only known repeat in dlpRNA appears once in the transcript at about 7440bp and corresponds to a short genomic direct repeat, and that position isn't in any of these matches.

Magdoll commented 6 years ago

Hi @mathog ,

What neither did though was to merge between the families

What do you refer to by "families" here? Do you mean merge transcripts/isoforms of the same gene (same locus, different splice forms) or different genes (different loci)?

The intention of Cogent was not to merge different genes (unless it can't help it because the two genes are way too similar).

I'm trying to understand the relationship between path5 and path11. Do I understand correctly that:

path11 matches completely to dlp gene 1-9288 bp. But is on the reverse strand? path5 looks like it matches to dlp, where 2159-15339 matches to 284-13356bp of the dlp gene, with a potentially little blip in the middle. (whether that blip is a gap in the Trinity assembly, currently unclear as you have stated)

I should mention Cogent is strand-aware (because it's used for Iso-Seq data which is strand aware). It looks to me that the input transcripts may be flipped? Not that it matters much, but you may want to flip them to the transcript sense strand.

I'm wondering out loud why path5 and path11 are not merged by Cogent, assembling effectively 1-13356bp of dlp gene (the whole gene). It seems to be it should be based on the mapped coordinates to dlp, unless there's some sequence difference between them that I can't tell based on this report.

Can you show an alignment between path5 and path11?

mathog commented 6 years ago

Families refers to all the sequences which have been clustered together for a given gene's transcripts. As far as we know there is just one dlp in these genomes and nothing very similar to it, so all the paths should be for that and not some mixture with other genes. The gene is huge and there are a lot of exons so there are almost certainly transcript variants mixed in with the Trinity assembly variations. (As in, Trinity results differ between software versions, whereas the biological variation in a given test library is fixed.)

Alignment of path5 and path11: cogent_path11_vs_path5

The overlap of path5 and path11 is identical for the first ~690bp, where there is a SNP, and then identical again until 1039, after which there is a small indel which is visible in the alignment above.

Selectively reverse complementing the input sequences before putting them into cogent isn't an option. For dlp we have a reference sequence, but in general one may not be known, or we may not trust the one we have very much. The zip files below contain the reference sequence shown as "dlpRNA" in the blastn results of previous posts, and the full set of paths including 5 and 11.

dystrophin_mRNA.zip

cogent_54527_175747_cogent2.renamed.fasta.zip